Principle:EvolvingLMMs Lab Lmms eval Data Sharding
| Knowledge Sources | |
|---|---|
| Domains | Distributed_Computing, Data_Processing |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
Partitioning evaluation data across distributed processes ensures that each process evaluates a disjoint subset of the dataset, maximizing throughput while guaranteeing full dataset coverage.
Description
In distributed evaluation, the complete dataset must be divided among all participating processes so that:
- Every document is evaluated exactly once -- No document is missed and no document is processed by multiple ranks.
- The load is approximately balanced -- Each rank receives roughly the same number of documents to minimize idle time.
- The assignment is deterministic -- Given the same rank, world_size, and dataset ordering, the same partition is produced every time.
Interleaved round-robin sharding is a simple and effective strategy that satisfies all three requirements. Rather than assigning contiguous blocks of documents to each rank (which can cause load imbalance if document difficulty varies systematically), round-robin distributes documents cyclically:
- Rank 0 gets documents at indices 0, world_size, 2*world_size, ...
- Rank 1 gets documents at indices 1, world_size+1, 2*world_size+1, ...
- Rank k gets documents at indices k, world_size+k, 2*world_size+k, ...
This interleaving ensures that any systematic variation in document characteristics (such as image resolution or text length varying with position) is spread evenly across ranks.
Usage
Use data sharding whenever running evaluation across multiple GPUs or nodes. The sharding is applied during the request-building phase -- before any model inference occurs -- so each rank builds evaluation instances only for its assigned documents. This approach also supports:
- Limiting evaluation -- A
limitparameter controls how many total documents to evaluate (useful for debugging or quick checks). - Offsetting -- An
offsetparameter skips the first N documents before sharding begins, enabling evaluation of specific dataset segments.
Theoretical Basis
Given a dataset of D documents indexed 0 through D-1, with W processes (world_size) and rank r:
Shard(r, W, D) = { i : i in [0, D), i mod W == r }
Equivalently, using Python's itertools.islice:
start = r
stop = D (or limit if specified)
step = W
Shard = islice(range(D), start, stop, step)
The number of documents assigned to rank r is:
|Shard(r, W, D)| = floor(D / W) + (1 if r < D mod W else 0)
This means the maximum imbalance between any two ranks is exactly 1 document. When D mod W != 0, the lower-numbered ranks each receive one extra document.
When an offset O is applied, the sharding formula becomes:
start = r + O
stop = O + limit (or None if no limit)
step = W
The offset shifts the starting position uniformly for all ranks, preserving the round-robin interleaving pattern.