Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:EvolvingLMMs Lab Lmms eval Result Gathering

From Leeroopedia
Revision as of 17:43, 16 February 2026 by Admin (talk | contribs) (Auto-imported from principles/EvolvingLMMs_Lab_Lmms_eval_Result_Gathering.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains Distributed_Computing, Data_Processing
Last Updated 2026-02-14 00:00 GMT

Overview

Collecting per-rank evaluation results to a single coordinator rank is necessary to reconstruct the complete evaluation output from a dataset that was distributed across multiple processes.

Description

After parallel inference completes, each rank holds evaluation results for only its own shard of the data. To compute dataset-level metrics, all per-rank results must be collected onto a single rank (conventionally rank 0, the "coordinator"). This process is known as result gathering.

Result gathering involves three distinct types of data:

  1. Logged samples -- Detailed per-document information including model inputs, outputs, and reference answers. These are used for qualitative analysis and debugging.
  2. Per-metric score lists -- Numerical scores for each document under each metric (e.g., accuracy, BLEU score). These are the raw data from which aggregate metrics are computed.
  3. Per-sample stability metrics -- When running multiple evaluation passes (k-samples mode), per-sample consistency and variance metrics are collected for statistical analysis.

Each of these is gathered separately using a gather-to-root collective operation, where all ranks send their local data to rank 0. On rank 0, the gathered data from all ranks is flattened into a single list, reconstructing the complete dataset's results.

Usage

Result gathering is required whenever evaluation is distributed across multiple GPUs. It occurs:

  • After all inference is complete and post-processing (filtering) has been applied
  • Before metric aggregation, which requires the complete dataset's results

On ranks other than 0, the gathered data is not needed, and those ranks simply wait at a barrier until rank 0 has finished processing.

Theoretical Basis

The gather operation follows a tree or direct pattern depending on the backend implementation:

Gather-to-root (rank 0):
  Each rank r sends data D_r to rank 0
  Rank 0 receives [D_0, D_1, ..., D_{W-1}]
  Rank 0 flattens: D_all = D_0 + D_1 + ... + D_{W-1}

Communication cost:
  - Data volume: sum(|D_r|) for r in {0, ..., W-1}
  - Latency: O(W) for direct gather, O(log W) for tree-based

The gather is performed three times per task:

For each task T:
  1. gather(logged_samples_r) -> logged_samples_all   (on rank 0)
  2. For each (metric, filter) in T.sample_metrics:
       gather(scores_r) -> scores_all                 (on rank 0)
  3. For each (metric, filter) in T.per_sample_metrics:
       gather(stability_r) -> stability_all           (on rank 0)

After all gathers complete, a barrier ensures every rank is synchronized before rank 0 proceeds to metric aggregation. This prevents rank 0 from starting aggregation while other ranks are still in the gather phase.

The total data transferred during gathering scales linearly with dataset size and the number of metrics, but is typically small compared to the inference data volume since only scores and metadata are gathered (not model weights or activations).

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment