Workflow:EvolvingLMMs Lab Lmms eval Distributed Multi GPU Evaluation
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Multimodal_Evaluation, Distributed_Computing |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
End-to-end process for running lmms-eval evaluations across multiple GPUs using HuggingFace Accelerate or torchrun, enabling parallel data processing for large-scale benchmarking.
Description
This workflow covers distributed evaluation of multimodal models across multiple GPUs. The framework supports two distributed backends: HuggingFace Accelerate (default) and PyTorch torchrun. In distributed mode, evaluation data is automatically sharded across ranks (each GPU processes a subset of the dataset), requests are padded to ensure equal work distribution, and results are gathered to rank 0 for final aggregation. This enables evaluation of large benchmarks that would be prohibitively slow on a single GPU, as well as evaluation of models that require tensor parallelism across multiple GPUs.
Usage
Execute this workflow when single-GPU evaluation is too slow for your benchmarking needs, when the model requires more VRAM than a single GPU provides (tensor parallelism), or when evaluating against large-scale benchmarks with thousands of samples. Distributed evaluation is also essential for CI/CD pipelines with time constraints.
Execution Steps
Step 1: Distributed Environment Setup
Configure the multi-GPU environment for distributed evaluation. The framework supports two launch mechanisms: HuggingFace Accelerate (via accelerate launch) and PyTorch torchrun. Both set environment variables (RANK, LOCAL_RANK, WORLD_SIZE) that the evaluator uses for work distribution. The Accelerator is initialized with a configurable process group timeout to handle long-running evaluations.
Key considerations:
- Accelerate launch: accelerate launch --num_processes N -m lmms_eval ...
- Torchrun launch: torchrun --nproc_per_node N -m lmms_eval ...
- NCCL is the default communication backend for GPU-to-GPU communication
- Set a generous timeout (default: 60000 seconds) for evaluations with large models or datasets
Step 2: Data Sharding
The evaluator automatically distributes evaluation data across ranks. Each GPU rank processes a disjoint subset of the dataset, determined by its rank index and the total world size. The create_iterator utility partitions documents evenly, with any remainder distributed to lower-ranked processes. The --limit and --offset parameters are applied per-rank, meaning the effective evaluation set is limit * world_size total samples.
Key considerations:
- Data sharding is transparent; no user configuration is needed
- Each rank loads the full dataset but only iterates over its assigned partition
- The --limit parameter applies per rank (total evaluated = limit * world_size)
- Few-shot examples are sampled identically across ranks for consistency
Step 3: Request Padding
Ensure equal workload across ranks by padding request queues. Since FSDP/DDP requires synchronized forward passes across ranks, the evaluator counts instances per rank and pads shorter queues with dummy requests. The padding count is computed per request type (generate_until, loglikelihood) based on the maximum instance count across all ranks. This prevents deadlocks during distributed inference.
Key considerations:
- Padding is computed separately for each request type
- Padding requests use the last real request as a template
- Results from padding requests are discarded during aggregation
- Without padding, distributed training would hang at synchronization barriers
Step 4: Parallel Inference
Execute model inference in parallel across all GPU ranks. Each rank independently runs its assigned requests through the model. Synchronization barriers (accelerator.wait_for_everyone() or dist.barrier()) ensure all ranks complete inference before proceeding to post-processing. For tensor-parallel models, the model itself is split across GPUs and all ranks process the same requests cooperatively.
Key considerations:
- Data parallelism: each rank runs the full model on different data
- Tensor parallelism: model layers are split across ranks for large models
- Barriers after inference ensure all ranks complete before gathering results
- GPU memory is freed after inference to allow post-processing (e.g., LLM-as-judge)
Step 5: Result Gathering
Collect per-rank results to rank 0 for final metric computation. Using torch.distributed.gather_object(), each rank sends its logged samples and per-metric score lists to rank 0. The gathered results are flattened (concatenated across ranks) to reconstruct the full evaluation dataset's results. Per-sample stability metrics are also gathered when using k-samples mode.
Key considerations:
- Only rank 0 computes final aggregated metrics
- Non-rank-0 processes return None from the evaluation
- A final dist.barrier() ensures all processes synchronize before cleanup
- Result gathering handles both logged samples and metric score lists
Step 6: Metric Aggregation
On rank 0, aggregate all gathered results into final evaluation metrics. This follows the same aggregation logic as single-GPU evaluation: bootstrap confidence intervals are computed, group-level metrics are consolidated from subtask results, and the results dictionary is assembled with configuration metadata, version information, and per-task scores. The formatted results table is printed and saved.
Key considerations:
- Aggregation runs only on rank 0 (global_rank == 0)
- Bootstrap iterations default to 100,000 for robust confidence intervals
- Group metrics are computed from constituent subtask results
- W&B logging and HuggingFace Hub pushing occur only on rank 0