Principle:EvolvingLMMs Lab Lmms eval PassAtK Evaluation

Knowledge Sources	EvolvingLMMs_Lab_Lmms_eval
Domains	Evaluation Metrics, Statistical Analysis
Last Updated	2026-02-14 00:00 GMT

Overview

Pass@k evaluation measures model reliability by generating multiple solutions and computing coverage, majority vote, and average correctness metrics.

Description

Pass@k evaluation addresses the stochastic nature of modern language models by generating multiple independent solutions for each problem and computing aggregate metrics. Instead of relying on a single sample (which may fail due to randomness), pass@k metrics assess whether a model can solve a problem given k attempts. This provides insights into model consistency, confidence (via majority vote), and overall capability. Three complementary metrics are computed: coverage@k (can the model solve it at least once?), majority@k (is the most common answer correct?), and average@k (what fraction of attempts are correct?).

Usage

Apply this principle when evaluating models on difficult tasks where single-sample accuracy is low, assessing model reliability and consistency across multiple runs, comparing different decoding strategies (temperature, top-p), or estimating how many attempts a user would need to get a correct answer.

Theoretical Basis

Metrics Definitions

Coverage@k (cov@k): Binary metric indicating if at least one of the first k samples is correct
- Formula: cov@k = 1 if any(correct[0:k]) else 0
- Interpretation: Probability of success with k attempts

Majority@k (maj@k): Binary metric indicating if the most common answer in first k samples is correct
- Formula: maj@k = 1 if mode(answers[0:k]) == ground_truth else 0
- Interpretation: Confidence and consistency of the model

Average@k (avg@k): Fraction of first k samples that are correct
- Formula: avg@k = sum(correct[0:k]) / k
- Interpretation: Expected accuracy per sample

Sampling Strategy

Generate k samples with temperature > 0 (typically 0.7-1.0)
Each sample should be independent (no caching between samples)
Common k values: 1, 2, 4, 8, 16, 32, 64 (powers of 2 for efficiency)
Evaluate all k samples before computing metrics

Statistical Interpretation

High cov@k, Low avg@k: Model occasionally finds correct answer but is inconsistent
High maj@k, High avg@k: Model is confident and correct
High cov@k, Low maj@k: Model generates diverse answers without strong mode
Low cov@k: Model cannot solve the problem reliably

Computational Considerations

Requires k times more compute than single-sample evaluation
Can parallelize sample generation for efficiency
Powers of 2 for k allow computing all metrics from single run (compute @2, @4, @8, ... from same samples)
Storage grows linearly with k (must save all samples)

Best Practices

Use k=64 or k=100 for reliable pass@k estimates
Report multiple k values to show scaling behavior
Separate model evaluation (low temp) from pass@k evaluation (high temp)
Consider cost-effectiveness: pass@10 often sufficient for practical insights

Relationship to Other Metrics

Pass@k vs Greedy Accuracy: Greedy uses temperature=0, pass@k uses temperature>0
Pass@k vs Beam Search: Beam search explores related paths, pass@k samples independently
Pass@k vs Self-Consistency: Self-consistency aggregates reasoning paths to improve single answer

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment