Principle:EvolvingLMMs Lab Lmms eval Distributed Environment Setup
| Knowledge Sources | |
|---|---|
| Domains | Distributed_Computing, Infrastructure |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
Initializing a distributed process group is the foundational step required before any multi-GPU evaluation can proceed, establishing communication channels and rank identity for every participating process.
Description
In distributed computing, a process group is a logical grouping of processes that can communicate with one another through collective operations such as broadcast, gather, and barrier synchronization. Before evaluation workloads can be split across multiple GPUs, each process must join a process group, learn its own rank (unique integer identifier), and know the total world size (number of participating processes).
The initialization phase must accomplish three things:
- Backend selection -- Choose a communication backend (typically NCCL for GPU-to-GPU transfers or Gloo for CPU-based communication). The backend determines which collective operations are available and their performance characteristics.
- Rank assignment -- Each process receives a globally unique rank (0 through world_size-1) and a local rank within its node. The local rank is used for GPU device assignment, while the global rank governs data partitioning and result collection.
- Timeout configuration -- Distributed operations that involve synchronization (barriers, gathers, all-reduces) must have a timeout to avoid indefinite hangs when one process fails. For large-scale evaluation workloads that may involve slow model inference, a generous timeout is critical.
There are two common approaches to launching distributed processes for evaluation:
- Accelerate launch -- The Hugging Face Accelerate library wraps PyTorch's distributed primitives and provides a higher-level
Acceleratorobject. This approach handles device placement, mixed precision, and process group initialization in a unified API. - Torchrun launch -- PyTorch's native
torchrunutility sets environment variables (LOCAL_RANK,RANK,WORLD_SIZE) and initializes the process group before the user script runs.
In either case, the result is the same: every process knows its rank, the world size, and has a working communication channel to all other processes.
Usage
Use distributed environment setup whenever:
- You have multiple GPUs available and want to parallelize evaluation across them
- The evaluation dataset is large enough that single-GPU evaluation is prohibitively slow
- You need to evaluate large models that require model parallelism (tensor or pipeline parallelism)
This principle applies at the very start of any distributed evaluation run, before data sharding, inference dispatch, or result gathering can occur.
Theoretical Basis
The distributed environment follows a Single Program Multiple Data (SPMD) paradigm. All processes execute the same evaluation script but operate on different data partitions:
Process Group Initialization:
For each process p in {0, 1, ..., N-1}:
1. p joins the process group with backend B
2. p receives rank = p, world_size = N
3. p sets its compute device to local_rank(p)
4. p configures timeout T for all collective operations
Postcondition:
- All N processes can communicate via collective ops
- rank(p) is unique for each p
- sum of all rank identities = N*(N-1)/2 (sanity check)
The timeout parameter T must be chosen to exceed the worst-case wall-clock time for any single collective operation. For evaluation, this is dominated by the slowest model inference step across all ranks. A common choice is T = 60000 seconds (approximately 16.7 hours), which accommodates very large evaluation runs.
The relationship between local rank and global rank depends on the number of nodes and GPUs per node:
global_rank = node_index * gpus_per_node + local_rank
world_size = num_nodes * gpus_per_node