Principle:EvolvingLMMs Lab Lmms eval Request Padding
| Knowledge Sources | |
|---|---|
| Domains | Distributed_Computing, Evaluation |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
Equalizing the number of inference requests across distributed ranks is necessary to prevent deadlocks during synchronized collective operations such as barriers and all-gathers.
Description
When evaluation data is sharded across multiple processes using round-robin distribution, ranks may end up with different numbers of evaluation instances. This imbalance arises from two sources:
- Uneven dataset size -- When the total number of documents D is not evenly divisible by the world size W, some ranks receive one more document than others (specifically, the first
D mod Wranks). - Variable requests per document -- Different documents may generate different numbers of model requests. For example, a multiple-choice question with 4 answer options generates 4 loglikelihood requests, while a question with 5 options generates 5.
This imbalance creates a critical problem for distributed training and evaluation frameworks that use synchronized collective operations. Frameworks like FSDP (Fully Sharded Data Parallel) and DDP (Distributed Data Parallel) insert synchronization barriers (such as gradient all-reduces or model-parallel communication) within each forward pass. If rank A has 100 forward passes and rank B has 102, then after rank A completes its last forward pass it will wait at the next barrier, but rank B still has 2 more forward passes to execute. Since rank B needs rank A to participate in the barrier during those 2 forward passes, both ranks deadlock.
The solution is request padding: after counting the number of instances on each rank, pad the shorter lists by duplicating the last request so that all ranks have the same number of forward passes. The results from these padded (duplicate) requests are simply discarded after inference.
Usage
Request padding is required whenever:
- The evaluation uses more than one GPU (
world_size > 1) - The model uses FSDP, DDP, or any other parallelism strategy that synchronizes across ranks during forward passes
- Different ranks have different numbers of evaluation instances (which is the common case)
Padding is computed after data sharding and request construction, but before inference dispatch.
Theoretical Basis
Let n_r be the number of instances assigned to rank r for a given request type. The padding amount for rank r is:
pad(r) = max(n_0, n_1, ..., n_{W-1}) - n_r
After padding, every rank has exactly max(n_0, ..., n_{W-1}) instances, ensuring all ranks execute the same number of forward passes.
The padding process requires a collective communication step to share instance counts:
1. Each rank r computes n_r = len(task._instances)
2. All ranks perform all_gather on n_r to obtain [n_0, n_1, ..., n_{W-1}]
3. Each rank r computes pad(r) = max([n_0, ..., n_{W-1}]) - n_r
4. Rank r appends pad(r) copies of its last request to its request list
The total number of wasted (duplicate) forward passes across all ranks is:
total_waste = W * max(n_r) - sum(n_r)
For well-balanced sharding where max(n_r) - min(n_r) <= 1, this waste is at most W - 1 forward passes total, which is negligible.