Principle:EvolvingLMMs Lab Lmms eval Metric Aggregation
| Knowledge Sources | |
|---|---|
| Domains | Metrics, Distributed_Computing |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
Consolidating per-task evaluation metrics from gathered results into final scores with statistical uncertainty estimates is the last step in distributed evaluation, transforming raw per-document scores into reportable benchmark numbers.
Description
After result gathering brings all per-rank data onto rank 0, the raw per-document scores must be aggregated into dataset-level metrics. This aggregation process has two levels:
Task-level aggregation computes a single metric value for each task by applying an aggregation function (typically mean, but potentially any custom function) to the list of per-document scores. For each metric, a standard error estimate is computed using bootstrap resampling to quantify uncertainty.
Group-level aggregation combines metrics from multiple related tasks (subtasks) into a single group score. For example, a benchmark like MMMU may consist of multiple subject categories, and the group-level score is the (possibly weighted) average across categories. Group-level standard errors are computed using pooled sample standard error formulas.
The aggregation also handles:
- Multiple metrics per task -- A task may report accuracy, BLEU, exact match, and other metrics simultaneously. Each is aggregated independently.
- Filter keys -- Results can be filtered (e.g., by answer extraction strategy), and aggregation is performed per (metric, filter) pair.
- Higher-is-better tracking -- Each metric is annotated with whether higher values indicate better performance, enabling correct comparison and ranking.
- Stability metrics -- When running in k-samples mode, additional metrics like expected accuracy, consensus accuracy, internal variance, and consistency rate are computed and aggregated.
Usage
Metric aggregation is performed exclusively on rank 0 after result gathering. Other ranks do not participate in this phase and return empty result dictionaries. The consolidated results are used for:
- Displaying evaluation tables in the terminal
- Saving results to JSON files
- Logging to experiment trackers (e.g., Weights & Biases)
- Comparing model performance across benchmarks
Theoretical Basis
Task-level aggregation applies an aggregation function to per-document scores:
Given scores S = [s_1, s_2, ..., s_N] for a metric m:
Aggregate value:
agg(m) = f(S) where f is typically mean: f(S) = (1/N) * sum(S)
Bootstrap standard error (B iterations):
For b = 1 to B:
S_b = sample with replacement from S (size N)
agg_b = f(S_b)
stderr(m) = std([agg_1, agg_2, ..., agg_B])
The default number of bootstrap iterations is 100,000, providing precise uncertainty estimates. For computationally expensive metrics (BLEU, chrF, TER), the number is reduced to 100.
Group-level aggregation uses either simple or weighted averaging:
Given subtask metrics M = [m_1, m_2, ..., m_K] with sample sizes N = [n_1, ..., n_K]:
Unweighted mean:
group_metric = (1/K) * sum(M)
Weighted mean (by sample size):
group_metric = sum(m_i * n_i) / sum(n_i)
Pooled standard error:
group_stderr = sqrt(sum(stderr_i^2 * n_i^2)) / sum(n_i)
The choice between weighted and unweighted averaging is controlled by the group's configuration (weight_by_size parameter).