Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:EvolvingLMMs Lab Lmms eval Request Padding

From Leeroopedia
Revision as of 17:18, 16 February 2026 by Admin (talk | contribs) (Auto-imported from principles/EvolvingLMMs_Lab_Lmms_eval_Request_Padding.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains Distributed_Computing, Evaluation
Last Updated 2026-02-14 00:00 GMT

Overview

Equalizing the number of inference requests across distributed ranks is necessary to prevent deadlocks during synchronized collective operations such as barriers and all-gathers.

Description

When evaluation data is sharded across multiple processes using round-robin distribution, ranks may end up with different numbers of evaluation instances. This imbalance arises from two sources:

  1. Uneven dataset size -- When the total number of documents D is not evenly divisible by the world size W, some ranks receive one more document than others (specifically, the first D mod W ranks).
  2. Variable requests per document -- Different documents may generate different numbers of model requests. For example, a multiple-choice question with 4 answer options generates 4 loglikelihood requests, while a question with 5 options generates 5.

This imbalance creates a critical problem for distributed training and evaluation frameworks that use synchronized collective operations. Frameworks like FSDP (Fully Sharded Data Parallel) and DDP (Distributed Data Parallel) insert synchronization barriers (such as gradient all-reduces or model-parallel communication) within each forward pass. If rank A has 100 forward passes and rank B has 102, then after rank A completes its last forward pass it will wait at the next barrier, but rank B still has 2 more forward passes to execute. Since rank B needs rank A to participate in the barrier during those 2 forward passes, both ranks deadlock.

The solution is request padding: after counting the number of instances on each rank, pad the shorter lists by duplicating the last request so that all ranks have the same number of forward passes. The results from these padded (duplicate) requests are simply discarded after inference.

Usage

Request padding is required whenever:

  • The evaluation uses more than one GPU (world_size > 1)
  • The model uses FSDP, DDP, or any other parallelism strategy that synchronizes across ranks during forward passes
  • Different ranks have different numbers of evaluation instances (which is the common case)

Padding is computed after data sharding and request construction, but before inference dispatch.

Theoretical Basis

Let n_r be the number of instances assigned to rank r for a given request type. The padding amount for rank r is:

pad(r) = max(n_0, n_1, ..., n_{W-1}) - n_r

After padding, every rank has exactly max(n_0, ..., n_{W-1}) instances, ensuring all ranks execute the same number of forward passes.

The padding process requires a collective communication step to share instance counts:

1. Each rank r computes n_r = len(task._instances)
2. All ranks perform all_gather on n_r to obtain [n_0, n_1, ..., n_{W-1}]
3. Each rank r computes pad(r) = max([n_0, ..., n_{W-1}]) - n_r
4. Rank r appends pad(r) copies of its last request to its request list

The total number of wasted (duplicate) forward passes across all ranks is:

total_waste = W * max(n_r) - sum(n_r)

For well-balanced sharding where max(n_r) - min(n_r) <= 1, this waste is at most W - 1 forward passes total, which is negligible.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment