Principle:EvolvingLMMs Lab Lmms eval Result Gathering
| Knowledge Sources | |
|---|---|
| Domains | Distributed_Computing, Data_Processing |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
Collecting per-rank evaluation results to a single coordinator rank is necessary to reconstruct the complete evaluation output from a dataset that was distributed across multiple processes.
Description
After parallel inference completes, each rank holds evaluation results for only its own shard of the data. To compute dataset-level metrics, all per-rank results must be collected onto a single rank (conventionally rank 0, the "coordinator"). This process is known as result gathering.
Result gathering involves three distinct types of data:
- Logged samples -- Detailed per-document information including model inputs, outputs, and reference answers. These are used for qualitative analysis and debugging.
- Per-metric score lists -- Numerical scores for each document under each metric (e.g., accuracy, BLEU score). These are the raw data from which aggregate metrics are computed.
- Per-sample stability metrics -- When running multiple evaluation passes (k-samples mode), per-sample consistency and variance metrics are collected for statistical analysis.
Each of these is gathered separately using a gather-to-root collective operation, where all ranks send their local data to rank 0. On rank 0, the gathered data from all ranks is flattened into a single list, reconstructing the complete dataset's results.
Usage
Result gathering is required whenever evaluation is distributed across multiple GPUs. It occurs:
- After all inference is complete and post-processing (filtering) has been applied
- Before metric aggregation, which requires the complete dataset's results
On ranks other than 0, the gathered data is not needed, and those ranks simply wait at a barrier until rank 0 has finished processing.
Theoretical Basis
The gather operation follows a tree or direct pattern depending on the backend implementation:
Gather-to-root (rank 0):
Each rank r sends data D_r to rank 0
Rank 0 receives [D_0, D_1, ..., D_{W-1}]
Rank 0 flattens: D_all = D_0 + D_1 + ... + D_{W-1}
Communication cost:
- Data volume: sum(|D_r|) for r in {0, ..., W-1}
- Latency: O(W) for direct gather, O(log W) for tree-based
The gather is performed three times per task:
For each task T:
1. gather(logged_samples_r) -> logged_samples_all (on rank 0)
2. For each (metric, filter) in T.sample_metrics:
gather(scores_r) -> scores_all (on rank 0)
3. For each (metric, filter) in T.per_sample_metrics:
gather(stability_r) -> stability_all (on rank 0)
After all gathers complete, a barrier ensures every rank is synchronized before rank 0 proceeds to metric aggregation. This prevents rank 0 from starting aggregation while other ranks are still in the gather phase.
The total data transferred during gathering scales linearly with dataset size and the number of metrics, but is typically small compared to the inference data volume since only scores and metadata are gathered (not model weights or activations).