Principle:EvolvingLMMs Lab Lmms eval Result Gathering

Knowledge Sources	lmms-eval
Domains	Distributed_Computing, Data_Processing
Last Updated	2026-02-14 00:00 GMT

Overview

Collecting per-rank evaluation results to a single coordinator rank is necessary to reconstruct the complete evaluation output from a dataset that was distributed across multiple processes.

Description

After parallel inference completes, each rank holds evaluation results for only its own shard of the data. To compute dataset-level metrics, all per-rank results must be collected onto a single rank (conventionally rank 0, the "coordinator"). This process is known as result gathering.

Result gathering involves three distinct types of data:

Logged samples -- Detailed per-document information including model inputs, outputs, and reference answers. These are used for qualitative analysis and debugging.
Per-metric score lists -- Numerical scores for each document under each metric (e.g., accuracy, BLEU score). These are the raw data from which aggregate metrics are computed.
Per-sample stability metrics -- When running multiple evaluation passes (k-samples mode), per-sample consistency and variance metrics are collected for statistical analysis.

Each of these is gathered separately using a gather-to-root collective operation, where all ranks send their local data to rank 0. On rank 0, the gathered data from all ranks is flattened into a single list, reconstructing the complete dataset's results.

Usage

Result gathering is required whenever evaluation is distributed across multiple GPUs. It occurs:

After all inference is complete and post-processing (filtering) has been applied
Before metric aggregation, which requires the complete dataset's results

On ranks other than 0, the gathered data is not needed, and those ranks simply wait at a barrier until rank 0 has finished processing.

Theoretical Basis

The gather operation follows a tree or direct pattern depending on the backend implementation:

Gather-to-root (rank 0):
  Each rank r sends data D_r to rank 0
  Rank 0 receives [D_0, D_1, ..., D_{W-1}]
  Rank 0 flattens: D_all = D_0 + D_1 + ... + D_{W-1}

Communication cost:
  - Data volume: sum(|D_r|) for r in {0, ..., W-1}
  - Latency: O(W) for direct gather, O(log W) for tree-based

The gather is performed three times per task:

For each task T:
  1. gather(logged_samples_r) -> logged_samples_all   (on rank 0)
  2. For each (metric, filter) in T.sample_metrics:
       gather(scores_r) -> scores_all                 (on rank 0)
  3. For each (metric, filter) in T.per_sample_metrics:
       gather(stability_r) -> stability_all           (on rank 0)

After all gathers complete, a barrier ensures every rank is synchronized before rank 0 proceeds to metric aggregation. This prevents rank 0 from starting aggregation while other ranks are still in the gather phase.

The total data transferred during gathering scales linearly with dataset size and the number of metrics, but is typically small compared to the inference data volume since only scores and metadata are gathered (not model weights or activations).

Related Pages

Implemented By

Implementation:EvolvingLMMs_Lab_Lmms_eval_Gather_Object

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment