Principle:EvolvingLMMs Lab Lmms eval PassAtK Evaluation
| Knowledge Sources | |
|---|---|
| Domains | Evaluation Metrics, Statistical Analysis |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
Pass@k evaluation measures model reliability by generating multiple solutions and computing coverage, majority vote, and average correctness metrics.
Description
Pass@k evaluation addresses the stochastic nature of modern language models by generating multiple independent solutions for each problem and computing aggregate metrics. Instead of relying on a single sample (which may fail due to randomness), pass@k metrics assess whether a model can solve a problem given k attempts. This provides insights into model consistency, confidence (via majority vote), and overall capability. Three complementary metrics are computed: coverage@k (can the model solve it at least once?), majority@k (is the most common answer correct?), and average@k (what fraction of attempts are correct?).
Usage
Apply this principle when evaluating models on difficult tasks where single-sample accuracy is low, assessing model reliability and consistency across multiple runs, comparing different decoding strategies (temperature, top-p), or estimating how many attempts a user would need to get a correct answer.
Theoretical Basis
Metrics Definitions
- Coverage@k (cov@k): Binary metric indicating if at least one of the first k samples is correct
- Formula: cov@k = 1 if any(correct[0:k]) else 0
- Interpretation: Probability of success with k attempts
- Majority@k (maj@k): Binary metric indicating if the most common answer in first k samples is correct
- Formula: maj@k = 1 if mode(answers[0:k]) == ground_truth else 0
- Interpretation: Confidence and consistency of the model
- Average@k (avg@k): Fraction of first k samples that are correct
- Formula: avg@k = sum(correct[0:k]) / k
- Interpretation: Expected accuracy per sample
Sampling Strategy
- Generate k samples with temperature > 0 (typically 0.7-1.0)
- Each sample should be independent (no caching between samples)
- Common k values: 1, 2, 4, 8, 16, 32, 64 (powers of 2 for efficiency)
- Evaluate all k samples before computing metrics
Statistical Interpretation
- High cov@k, Low avg@k: Model occasionally finds correct answer but is inconsistent
- High maj@k, High avg@k: Model is confident and correct
- High cov@k, Low maj@k: Model generates diverse answers without strong mode
- Low cov@k: Model cannot solve the problem reliably
Computational Considerations
- Requires k times more compute than single-sample evaluation
- Can parallelize sample generation for efficiency
- Powers of 2 for k allow computing all metrics from single run (compute @2, @4, @8, ... from same samples)
- Storage grows linearly with k (must save all samples)
Best Practices
- Use k=64 or k=100 for reliable pass@k estimates
- Report multiple k values to show scaling behavior
- Separate model evaluation (low temp) from pass@k evaluation (high temp)
- Consider cost-effectiveness: pass@10 often sufficient for practical insights
Relationship to Other Metrics
- Pass@k vs Greedy Accuracy: Greedy uses temperature=0, pass@k uses temperature>0
- Pass@k vs Beam Search: Beam search explores related paths, pass@k samples independently
- Pass@k vs Self-Consistency: Self-consistency aggregates reasoning paths to improve single answer