Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:EvolvingLMMs Lab Lmms eval Consolidate Results

From Leeroopedia
Knowledge Sources
Domains Metrics, Distributed_Computing
Last Updated 2026-02-14 00:00 GMT

Overview

Concrete tool for consolidating per-task metrics from gathered results with group-level aggregation and bootstrap standard error provided by the lmms-eval framework.

Description

The metric consolidation logic resides in lmms_eval/evaluator_utils.py and consists of three main components:

1. calculate_aggregate_metric (L115-132) -- A method on TaskOutput that computes the aggregated metric value and bootstrap standard error for each (metric, filter_key) pair in a task. It calls the task's configured aggregation function (e.g., mean, custom function) on the list of per-document scores, then computes bootstrap standard error by resampling. Results are stored in task_output.agg_metrics.

2. consolidate_results (L382-455) -- Iterates over all TaskOutput objects and assembles six dictionaries: results (aggregated metric values), samples (logged per-document data), configs (task YAML configurations), versions (task version numbers), num_fewshot (few-shot counts), and higher_is_better (metric direction indicators). This function collects all per-task information into a unified structure suitable for display and serialization.

3. consolidate_group_results (L458-567) -- Recursively traverses the task hierarchy to compute group-level metrics. For each group with an aggregate_metric_list configuration, it gathers the metrics and standard errors from all subtasks, applies the configured aggregation function (typically aggregate_subtask_metrics for mean), and computes pooled standard errors using pooled_sample_stderr. The function supports nested groups and weighted/unweighted averaging.

All three components run exclusively on rank 0. Other ranks return empty dictionaries and do not participate in aggregation.

Usage

These functions are called at the end of the evaluation pipeline, after result gathering:

# On rank 0 only:
for task_output in eval_tasks:
    task_output.calculate_aggregate_metric(bootstrap_iters=bootstrap_iters)
    task_output.calculate_clt_aggregate_metric()

results, samples, configs, versions, num_fewshot, higher_is_better = consolidate_results(eval_tasks)

results, versions, show_group_table, *_ = consolidate_group_results(
    results, versions, task_dict, ...
)

Code Reference

Source Location

  • Repository: lmms-eval
  • File: lmms_eval/evaluator_utils.py
  • Lines: L115-132 (calculate_aggregate_metric), L382-455 (consolidate_results), L458-567 (consolidate_group_results)

Signature

# TaskOutput method
def calculate_aggregate_metric(self, bootstrap_iters: int = 100000) -> None:
    ...

# Module-level function
def consolidate_results(
    eval_tasks: List[TaskOutput],
) -> Tuple[dict, dict, dict, dict, dict, dict]:
    ...

# Module-level function
def consolidate_group_results(
    results: dict,
    versions: dict,
    task_dict: dict,
    task_root: Optional[str] = None,
    show_group_table: bool = False,
    task_aggregation_list: Optional[dict] = None,
) -> Tuple[dict, dict, bool, Union[None, dict]]:
    ...

Import

from lmms_eval.evaluator_utils import (
    consolidate_results,
    consolidate_group_results,
    TaskOutput,
)

I/O Contract

Inputs

Name Type Required Description
eval_tasks List[TaskOutput] Yes List of TaskOutput objects containing gathered per-task metrics, samples, and configuration
bootstrap_iters int No (default: 100000) Number of bootstrap iterations for standard error estimation; set to 0 to disable
task_dict dict Yes (for group results) Hierarchical dictionary mapping group/task names to Task objects or nested group dicts
task_root Optional[str] No (default: None) The parent group name during recursive traversal
show_group_table bool No (default: False) Whether any group requires a group-level results table
task_aggregation_list Optional[dict] No (default: None) Accumulator mapping group names to lists of subtask names

Outputs

Name Type Description
results defaultdict(dict) Maps task/group names to dicts of "metric,filter_key" -> aggregated_value pairs, plus "alias" and "samples"
samples defaultdict(list) Maps task names to lists of logged sample dicts (model outputs, ground truth, metadata)
configs defaultdict(dict) Maps task names to their YAML configuration dictionaries
versions defaultdict(dict) Maps task names to their version identifiers
num_fewshot defaultdict(int) Maps task names to the number of few-shot examples used
higher_is_better defaultdict(dict) Maps task names to dicts indicating metric direction (True = higher is better)
show_group_table bool Whether any group-level aggregation table should be displayed
task_aggregation_list dict Maps group names to lists of their constituent subtask names

Usage Examples

Basic Example

from lmms_eval.evaluator_utils import consolidate_results, consolidate_group_results

# After gathering results on rank 0
RANK = 0

if RANK == 0:
    # Step 1: Compute per-task aggregate metrics
    for task_output in eval_tasks:
        task_output.calculate_aggregate_metric(bootstrap_iters=100000)

    # Step 2: Consolidate all task results
    (
        results,
        samples,
        configs,
        versions,
        num_fewshot,
        higher_is_better,
    ) = consolidate_results(eval_tasks)

    # Step 3: Compute group-level aggregations
    results, versions, show_group_table, _ = consolidate_group_results(
        results,
        versions,
        task_dict,
    )

    # results now contains both per-task and per-group metrics
    # Example: results["mmmu"]["exact_match,none"] = 0.45
    #          results["mmmu"]["exact_match_stderr,none"] = 0.012

Bootstrap Standard Error Calculation

# Inside calculate_aggregate_metric:
for (metric, filter_key), items in self.sample_metrics.items():
    agg_fn = self.task.aggregation()[metric]
    metric_key = f"{metric},{filter_key}"

    # Compute aggregate value
    self.agg_metrics[metric_key] = agg_fn(items)
    self.sample_len = len(items)

    # Compute bootstrap standard error
    stderr_fn = stderr_for_metric(
        metric=agg_fn,
        bootstrap_iters=bootstrap_iters,
    )
    self.agg_metrics[f"{metric}_stderr,{filter_key}"] = (
        stderr_fn(items) if (stderr_fn and len(items) > 1) else "N/A"
    )

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment