Implementation:EvolvingLMMs Lab Lmms eval Consolidate Results

Knowledge Sources	lmms-eval
Domains	Metrics, Distributed_Computing
Last Updated	2026-02-14 00:00 GMT

Overview

Concrete tool for consolidating per-task metrics from gathered results with group-level aggregation and bootstrap standard error provided by the lmms-eval framework.

Description

The metric consolidation logic resides in lmms_eval/evaluator_utils.py and consists of three main components:

1. calculate_aggregate_metric (L115-132) -- A method on TaskOutput that computes the aggregated metric value and bootstrap standard error for each (metric, filter_key) pair in a task. It calls the task's configured aggregation function (e.g., mean, custom function) on the list of per-document scores, then computes bootstrap standard error by resampling. Results are stored in task_output.agg_metrics.

2. consolidate_results (L382-455) -- Iterates over all TaskOutput objects and assembles six dictionaries: results (aggregated metric values), samples (logged per-document data), configs (task YAML configurations), versions (task version numbers), num_fewshot (few-shot counts), and higher_is_better (metric direction indicators). This function collects all per-task information into a unified structure suitable for display and serialization.

3. consolidate_group_results (L458-567) -- Recursively traverses the task hierarchy to compute group-level metrics. For each group with an aggregate_metric_list configuration, it gathers the metrics and standard errors from all subtasks, applies the configured aggregation function (typically aggregate_subtask_metrics for mean), and computes pooled standard errors using pooled_sample_stderr. The function supports nested groups and weighted/unweighted averaging.

All three components run exclusively on rank 0. Other ranks return empty dictionaries and do not participate in aggregation.

Usage

These functions are called at the end of the evaluation pipeline, after result gathering:

# On rank 0 only:
for task_output in eval_tasks:
    task_output.calculate_aggregate_metric(bootstrap_iters=bootstrap_iters)
    task_output.calculate_clt_aggregate_metric()

results, samples, configs, versions, num_fewshot, higher_is_better = consolidate_results(eval_tasks)

results, versions, show_group_table, *_ = consolidate_group_results(
    results, versions, task_dict, ...
)

Code Reference

Source Location

Repository: lmms-eval
File: lmms_eval/evaluator_utils.py
Lines: L115-132 (calculate_aggregate_metric), L382-455 (consolidate_results), L458-567 (consolidate_group_results)

Signature

# TaskOutput method
def calculate_aggregate_metric(self, bootstrap_iters: int = 100000) -> None:
    ...

# Module-level function
def consolidate_results(
    eval_tasks: List[TaskOutput],
) -> Tuple[dict, dict, dict, dict, dict, dict]:
    ...

# Module-level function
def consolidate_group_results(
    results: dict,
    versions: dict,
    task_dict: dict,
    task_root: Optional[str] = None,
    show_group_table: bool = False,
    task_aggregation_list: Optional[dict] = None,
) -> Tuple[dict, dict, bool, Union[None, dict]]:
    ...

Import

from lmms_eval.evaluator_utils import (
    consolidate_results,
    consolidate_group_results,
    TaskOutput,
)

I/O Contract

Inputs

Name	Type	Required	Description
eval_tasks	`List[TaskOutput]`	Yes	List of TaskOutput objects containing gathered per-task metrics, samples, and configuration
bootstrap_iters	`int`	No (default: 100000)	Number of bootstrap iterations for standard error estimation; set to 0 to disable
task_dict	`dict`	Yes (for group results)	Hierarchical dictionary mapping group/task names to Task objects or nested group dicts
task_root	`Optional[str]`	No (default: None)	The parent group name during recursive traversal
show_group_table	`bool`	No (default: False)	Whether any group requires a group-level results table
task_aggregation_list	`Optional[dict]`	No (default: None)	Accumulator mapping group names to lists of subtask names

Outputs

Name	Type	Description
results	`defaultdict(dict)`	Maps task/group names to dicts of `"metric,filter_key" -> aggregated_value` pairs, plus `"alias"` and `"samples"`
samples	`defaultdict(list)`	Maps task names to lists of logged sample dicts (model outputs, ground truth, metadata)
configs	`defaultdict(dict)`	Maps task names to their YAML configuration dictionaries
versions	`defaultdict(dict)`	Maps task names to their version identifiers
num_fewshot	`defaultdict(int)`	Maps task names to the number of few-shot examples used
higher_is_better	`defaultdict(dict)`	Maps task names to dicts indicating metric direction (`True` = higher is better)
show_group_table	`bool`	Whether any group-level aggregation table should be displayed
task_aggregation_list	`dict`	Maps group names to lists of their constituent subtask names

Usage Examples

Basic Example

from lmms_eval.evaluator_utils import consolidate_results, consolidate_group_results

# After gathering results on rank 0
RANK = 0

if RANK == 0:
    # Step 1: Compute per-task aggregate metrics
    for task_output in eval_tasks:
        task_output.calculate_aggregate_metric(bootstrap_iters=100000)

    # Step 2: Consolidate all task results
    (
        results,
        samples,
        configs,
        versions,
        num_fewshot,
        higher_is_better,
    ) = consolidate_results(eval_tasks)

    # Step 3: Compute group-level aggregations
    results, versions, show_group_table, _ = consolidate_group_results(
        results,
        versions,
        task_dict,
    )

    # results now contains both per-task and per-group metrics
    # Example: results["mmmu"]["exact_match,none"] = 0.45
    #          results["mmmu"]["exact_match_stderr,none"] = 0.012

Bootstrap Standard Error Calculation

# Inside calculate_aggregate_metric:
for (metric, filter_key), items in self.sample_metrics.items():
    agg_fn = self.task.aggregation()[metric]
    metric_key = f"{metric},{filter_key}"

    # Compute aggregate value
    self.agg_metrics[metric_key] = agg_fn(items)
    self.sample_len = len(items)

    # Compute bootstrap standard error
    stderr_fn = stderr_for_metric(
        metric=agg_fn,
        bootstrap_iters=bootstrap_iters,
    )
    self.agg_metrics[f"{metric}_stderr,{filter_key}"] = (
        stderr_fn(items) if (stderr_fn and len(items) > 1) else "N/A"
    )

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment