Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:EvolvingLMMs Lab Lmms eval PassAtK Evaluation

From Leeroopedia
Knowledge Sources
Domains Evaluation Metrics, Statistical Analysis
Last Updated 2026-02-14 00:00 GMT

Overview

Pass@k evaluation measures model reliability by generating multiple solutions and computing coverage, majority vote, and average correctness metrics.

Description

Pass@k evaluation addresses the stochastic nature of modern language models by generating multiple independent solutions for each problem and computing aggregate metrics. Instead of relying on a single sample (which may fail due to randomness), pass@k metrics assess whether a model can solve a problem given k attempts. This provides insights into model consistency, confidence (via majority vote), and overall capability. Three complementary metrics are computed: coverage@k (can the model solve it at least once?), majority@k (is the most common answer correct?), and average@k (what fraction of attempts are correct?).

Usage

Apply this principle when evaluating models on difficult tasks where single-sample accuracy is low, assessing model reliability and consistency across multiple runs, comparing different decoding strategies (temperature, top-p), or estimating how many attempts a user would need to get a correct answer.

Theoretical Basis

Metrics Definitions

  • Coverage@k (cov@k): Binary metric indicating if at least one of the first k samples is correct
    • Formula: cov@k = 1 if any(correct[0:k]) else 0
    • Interpretation: Probability of success with k attempts
  • Majority@k (maj@k): Binary metric indicating if the most common answer in first k samples is correct
    • Formula: maj@k = 1 if mode(answers[0:k]) == ground_truth else 0
    • Interpretation: Confidence and consistency of the model
  • Average@k (avg@k): Fraction of first k samples that are correct
    • Formula: avg@k = sum(correct[0:k]) / k
    • Interpretation: Expected accuracy per sample

Sampling Strategy

  • Generate k samples with temperature > 0 (typically 0.7-1.0)
  • Each sample should be independent (no caching between samples)
  • Common k values: 1, 2, 4, 8, 16, 32, 64 (powers of 2 for efficiency)
  • Evaluate all k samples before computing metrics

Statistical Interpretation

  • High cov@k, Low avg@k: Model occasionally finds correct answer but is inconsistent
  • High maj@k, High avg@k: Model is confident and correct
  • High cov@k, Low maj@k: Model generates diverse answers without strong mode
  • Low cov@k: Model cannot solve the problem reliably

Computational Considerations

  • Requires k times more compute than single-sample evaluation
  • Can parallelize sample generation for efficiency
  • Powers of 2 for k allow computing all metrics from single run (compute @2, @4, @8, ... from same samples)
  • Storage grows linearly with k (must save all samples)

Best Practices

  • Use k=64 or k=100 for reliable pass@k estimates
  • Report multiple k values to show scaling behavior
  • Separate model evaluation (low temp) from pass@k evaluation (high temp)
  • Consider cost-effectiveness: pass@10 often sufficient for practical insights

Relationship to Other Metrics

  • Pass@k vs Greedy Accuracy: Greedy uses temperature=0, pass@k uses temperature>0
  • Pass@k vs Beam Search: Beam search explores related paths, pass@k samples independently
  • Pass@k vs Self-Consistency: Self-consistency aggregates reasoning paths to improve single answer

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment