Principle:EvolvingLMMs Lab Lmms eval Post Processing and Metrics
| Knowledge Sources | |
|---|---|
| Domains | Evaluation, Metrics |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
Post-processing and metrics is the stage where raw model outputs are filtered, normalized, scored against ground truth, and aggregated into benchmark-level statistics with standard error estimates.
Description
After model inference completes, the raw responses must be processed through several stages before final scores can be computed. This principle covers the entire pipeline from raw model outputs to aggregated benchmark metrics.
The pipeline consists of three phases:
Phase 1 -- Filtering:
Filter ensembles are applied to the raw model responses via task.apply_filters(). The default filter is take_first, which selects the first response when multiple responses exist (from repetition). Custom filter pipelines can be configured in task YAML files to perform operations like regex extraction, normalization, or response selection.
Phase 2 -- Per-Document Scoring:
The process_results(doc, results) method scores each document against its ground truth. The scoring logic depends on the output type:
- generate_until -- Applies each metric function from the task's
metric_listto the generated text vs. reference text. Supports multiple references and metrics like exact match, BLEU, ANLS, and custom functions. - multiple_choice -- Computes accuracy by comparing the argmin of log-likelihoods against the gold label. Supports normalized accuracy (by completion length) and mutual information scoring.
- loglikelihood -- Returns accuracy (is the greedy completion correct?) and perplexity.
Tasks can also define a custom process_results function via their YAML configuration, which is invoked with the document and results as arguments.
Phase 3 -- Aggregation:
Per-document scores are aggregated across the evaluation split using registered aggregation functions:
- mean -- Simple arithmetic mean (used for accuracy, exact match).
- perplexity -- Exponential of negative mean log-likelihood.
- bleu/chrf/ter -- Corpus-level machine translation metrics.
- median, f1, matthews_corrcoef -- Other statistical aggregations.
Standard error is computed via bootstrap resampling (default 100,000 iterations) for metrics where analytical stderr is not available.
Usage
Use post-processing and metrics whenever:
- You are computing final benchmark scores from model outputs.
- You need to define custom metrics for a new task.
- You want to understand how standard errors are computed for reported scores.
- You are debugging unexpected metric values by inspecting per-document scores.
Theoretical Basis
Metric Registration:
Metrics are registered in a global registry using decorators:
@register_metric(
metric="exact_match",
higher_is_better=True,
output_type="generate_until",
aggregation="mean",
)
Each registration binds a metric name to a scoring function, an aggregation function, and metadata about directionality.
Bootstrap Standard Error:
For metrics that do not have a closed-form standard error, the framework uses bootstrap resampling:
Given scores S = [s_1, ..., s_n] and aggregation function f:
- For
i = 1, ..., Biterations (defaultB = 100000):- Draw a bootstrap sample
S*_iof sizenwith replacement fromS. - Compute
theta*_i = f(S*_i).
- Draw a bootstrap sample
- Return
stderr = sample_stddev([theta*_1, ..., theta*_B]).
For simple mean aggregation, the analytical formula is used instead:
stderr = sample_stddev(S) / sqrt(n)
Clustered Standard Error:
For benchmarks where multiple questions share the same context (e.g., multiple questions about the same image), the framework supports clustered standard error estimation following the formula:
SE_clustered = sqrt(SE_CLT^2 + (1/n^2) * sum_c sum_i sum_{j!=i} (s_ic - s_bar)(s_jc - s_bar))
This accounts for within-cluster correlation and provides more honest uncertainty estimates.