Principle:EvolvingLMMs Lab Lmms eval Metric Aggregation

Knowledge Sources	lmms-eval
Domains	Metrics, Distributed_Computing
Last Updated	2026-02-14 00:00 GMT

Overview

Consolidating per-task evaluation metrics from gathered results into final scores with statistical uncertainty estimates is the last step in distributed evaluation, transforming raw per-document scores into reportable benchmark numbers.

Description

After result gathering brings all per-rank data onto rank 0, the raw per-document scores must be aggregated into dataset-level metrics. This aggregation process has two levels:

Task-level aggregation computes a single metric value for each task by applying an aggregation function (typically mean, but potentially any custom function) to the list of per-document scores. For each metric, a standard error estimate is computed using bootstrap resampling to quantify uncertainty.

Group-level aggregation combines metrics from multiple related tasks (subtasks) into a single group score. For example, a benchmark like MMMU may consist of multiple subject categories, and the group-level score is the (possibly weighted) average across categories. Group-level standard errors are computed using pooled sample standard error formulas.

The aggregation also handles:

Multiple metrics per task -- A task may report accuracy, BLEU, exact match, and other metrics simultaneously. Each is aggregated independently.
Filter keys -- Results can be filtered (e.g., by answer extraction strategy), and aggregation is performed per (metric, filter) pair.
Higher-is-better tracking -- Each metric is annotated with whether higher values indicate better performance, enabling correct comparison and ranking.
Stability metrics -- When running in k-samples mode, additional metrics like expected accuracy, consensus accuracy, internal variance, and consistency rate are computed and aggregated.

Usage

Metric aggregation is performed exclusively on rank 0 after result gathering. Other ranks do not participate in this phase and return empty result dictionaries. The consolidated results are used for:

Displaying evaluation tables in the terminal
Saving results to JSON files
Logging to experiment trackers (e.g., Weights & Biases)
Comparing model performance across benchmarks

Theoretical Basis

Task-level aggregation applies an aggregation function to per-document scores:

Given scores S = [s_1, s_2, ..., s_N] for a metric m:

Aggregate value:
  agg(m) = f(S)     where f is typically mean: f(S) = (1/N) * sum(S)

Bootstrap standard error (B iterations):
  For b = 1 to B:
    S_b = sample with replacement from S (size N)
    agg_b = f(S_b)
  stderr(m) = std([agg_1, agg_2, ..., agg_B])

The default number of bootstrap iterations is 100,000, providing precise uncertainty estimates. For computationally expensive metrics (BLEU, chrF, TER), the number is reduced to 100.

Group-level aggregation uses either simple or weighted averaging:

Given subtask metrics M = [m_1, m_2, ..., m_K] with sample sizes N = [n_1, ..., n_K]:

Unweighted mean:
  group_metric = (1/K) * sum(M)

Weighted mean (by sample size):
  group_metric = sum(m_i * n_i) / sum(n_i)

Pooled standard error:
  group_stderr = sqrt(sum(stderr_i^2 * n_i^2)) / sum(n_i)

The choice between weighted and unweighted averaging is controlled by the group's configuration (weight_by_size parameter).

Related Pages

Implemented By

Implementation:EvolvingLMMs_Lab_Lmms_eval_Consolidate_Results

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment