Implementation:EvolvingLMMs Lab Lmms eval Consolidate Results
| Knowledge Sources | |
|---|---|
| Domains | Metrics, Distributed_Computing |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
Concrete tool for consolidating per-task metrics from gathered results with group-level aggregation and bootstrap standard error provided by the lmms-eval framework.
Description
The metric consolidation logic resides in lmms_eval/evaluator_utils.py and consists of three main components:
1. calculate_aggregate_metric (L115-132) -- A method on TaskOutput that computes the aggregated metric value and bootstrap standard error for each (metric, filter_key) pair in a task. It calls the task's configured aggregation function (e.g., mean, custom function) on the list of per-document scores, then computes bootstrap standard error by resampling. Results are stored in task_output.agg_metrics.
2. consolidate_results (L382-455) -- Iterates over all TaskOutput objects and assembles six dictionaries: results (aggregated metric values), samples (logged per-document data), configs (task YAML configurations), versions (task version numbers), num_fewshot (few-shot counts), and higher_is_better (metric direction indicators). This function collects all per-task information into a unified structure suitable for display and serialization.
3. consolidate_group_results (L458-567) -- Recursively traverses the task hierarchy to compute group-level metrics. For each group with an aggregate_metric_list configuration, it gathers the metrics and standard errors from all subtasks, applies the configured aggregation function (typically aggregate_subtask_metrics for mean), and computes pooled standard errors using pooled_sample_stderr. The function supports nested groups and weighted/unweighted averaging.
All three components run exclusively on rank 0. Other ranks return empty dictionaries and do not participate in aggregation.
Usage
These functions are called at the end of the evaluation pipeline, after result gathering:
# On rank 0 only:
for task_output in eval_tasks:
task_output.calculate_aggregate_metric(bootstrap_iters=bootstrap_iters)
task_output.calculate_clt_aggregate_metric()
results, samples, configs, versions, num_fewshot, higher_is_better = consolidate_results(eval_tasks)
results, versions, show_group_table, *_ = consolidate_group_results(
results, versions, task_dict, ...
)
Code Reference
Source Location
- Repository: lmms-eval
- File:
lmms_eval/evaluator_utils.py - Lines: L115-132 (
calculate_aggregate_metric), L382-455 (consolidate_results), L458-567 (consolidate_group_results)
Signature
# TaskOutput method
def calculate_aggregate_metric(self, bootstrap_iters: int = 100000) -> None:
...
# Module-level function
def consolidate_results(
eval_tasks: List[TaskOutput],
) -> Tuple[dict, dict, dict, dict, dict, dict]:
...
# Module-level function
def consolidate_group_results(
results: dict,
versions: dict,
task_dict: dict,
task_root: Optional[str] = None,
show_group_table: bool = False,
task_aggregation_list: Optional[dict] = None,
) -> Tuple[dict, dict, bool, Union[None, dict]]:
...
Import
from lmms_eval.evaluator_utils import (
consolidate_results,
consolidate_group_results,
TaskOutput,
)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| eval_tasks | List[TaskOutput] |
Yes | List of TaskOutput objects containing gathered per-task metrics, samples, and configuration |
| bootstrap_iters | int |
No (default: 100000) | Number of bootstrap iterations for standard error estimation; set to 0 to disable |
| task_dict | dict |
Yes (for group results) | Hierarchical dictionary mapping group/task names to Task objects or nested group dicts |
| task_root | Optional[str] |
No (default: None) | The parent group name during recursive traversal |
| show_group_table | bool |
No (default: False) | Whether any group requires a group-level results table |
| task_aggregation_list | Optional[dict] |
No (default: None) | Accumulator mapping group names to lists of subtask names |
Outputs
| Name | Type | Description |
|---|---|---|
| results | defaultdict(dict) |
Maps task/group names to dicts of "metric,filter_key" -> aggregated_value pairs, plus "alias" and "samples"
|
| samples | defaultdict(list) |
Maps task names to lists of logged sample dicts (model outputs, ground truth, metadata) |
| configs | defaultdict(dict) |
Maps task names to their YAML configuration dictionaries |
| versions | defaultdict(dict) |
Maps task names to their version identifiers |
| num_fewshot | defaultdict(int) |
Maps task names to the number of few-shot examples used |
| higher_is_better | defaultdict(dict) |
Maps task names to dicts indicating metric direction (True = higher is better)
|
| show_group_table | bool |
Whether any group-level aggregation table should be displayed |
| task_aggregation_list | dict |
Maps group names to lists of their constituent subtask names |
Usage Examples
Basic Example
from lmms_eval.evaluator_utils import consolidate_results, consolidate_group_results
# After gathering results on rank 0
RANK = 0
if RANK == 0:
# Step 1: Compute per-task aggregate metrics
for task_output in eval_tasks:
task_output.calculate_aggregate_metric(bootstrap_iters=100000)
# Step 2: Consolidate all task results
(
results,
samples,
configs,
versions,
num_fewshot,
higher_is_better,
) = consolidate_results(eval_tasks)
# Step 3: Compute group-level aggregations
results, versions, show_group_table, _ = consolidate_group_results(
results,
versions,
task_dict,
)
# results now contains both per-task and per-group metrics
# Example: results["mmmu"]["exact_match,none"] = 0.45
# results["mmmu"]["exact_match_stderr,none"] = 0.012
Bootstrap Standard Error Calculation
# Inside calculate_aggregate_metric:
for (metric, filter_key), items in self.sample_metrics.items():
agg_fn = self.task.aggregation()[metric]
metric_key = f"{metric},{filter_key}"
# Compute aggregate value
self.agg_metrics[metric_key] = agg_fn(items)
self.sample_len = len(items)
# Compute bootstrap standard error
stderr_fn = stderr_for_metric(
metric=agg_fn,
bootstrap_iters=bootstrap_iters,
)
self.agg_metrics[f"{metric}_stderr,{filter_key}"] = (
stderr_fn(items) if (stderr_fn and len(items) > 1) else "N/A"
)