Workflow:EvolvingLMMs Lab Lmms eval Distributed Multi GPU Evaluation

Knowledge Sources	lmms-eval HuggingFace Accelerate
Domains	LLMs, Multimodal_Evaluation, Distributed_Computing
Last Updated	2026-02-14 00:00 GMT

Overview

End-to-end process for running lmms-eval evaluations across multiple GPUs using HuggingFace Accelerate or torchrun, enabling parallel data processing for large-scale benchmarking.

Description

This workflow covers distributed evaluation of multimodal models across multiple GPUs. The framework supports two distributed backends: HuggingFace Accelerate (default) and PyTorch torchrun. In distributed mode, evaluation data is automatically sharded across ranks (each GPU processes a subset of the dataset), requests are padded to ensure equal work distribution, and results are gathered to rank 0 for final aggregation. This enables evaluation of large benchmarks that would be prohibitively slow on a single GPU, as well as evaluation of models that require tensor parallelism across multiple GPUs.

Usage

Execute this workflow when single-GPU evaluation is too slow for your benchmarking needs, when the model requires more VRAM than a single GPU provides (tensor parallelism), or when evaluating against large-scale benchmarks with thousands of samples. Distributed evaluation is also essential for CI/CD pipelines with time constraints.

Execution Steps

Step 1: Distributed Environment Setup

Configure the multi-GPU environment for distributed evaluation. The framework supports two launch mechanisms: HuggingFace Accelerate (via accelerate launch) and PyTorch torchrun. Both set environment variables (RANK, LOCAL_RANK, WORLD_SIZE) that the evaluator uses for work distribution. The Accelerator is initialized with a configurable process group timeout to handle long-running evaluations.

Key considerations:

Accelerate launch: accelerate launch --num_processes N -m lmms_eval ...
Torchrun launch: torchrun --nproc_per_node N -m lmms_eval ...
NCCL is the default communication backend for GPU-to-GPU communication
Set a generous timeout (default: 60000 seconds) for evaluations with large models or datasets

Step 2: Data Sharding

The evaluator automatically distributes evaluation data across ranks. Each GPU rank processes a disjoint subset of the dataset, determined by its rank index and the total world size. The create_iterator utility partitions documents evenly, with any remainder distributed to lower-ranked processes. The --limit and --offset parameters are applied per-rank, meaning the effective evaluation set is limit * world_size total samples.

Key considerations:

Data sharding is transparent; no user configuration is needed
Each rank loads the full dataset but only iterates over its assigned partition
The --limit parameter applies per rank (total evaluated = limit * world_size)
Few-shot examples are sampled identically across ranks for consistency

Step 3: Request Padding

Ensure equal workload across ranks by padding request queues. Since FSDP/DDP requires synchronized forward passes across ranks, the evaluator counts instances per rank and pads shorter queues with dummy requests. The padding count is computed per request type (generate_until, loglikelihood) based on the maximum instance count across all ranks. This prevents deadlocks during distributed inference.

Key considerations:

Padding is computed separately for each request type
Padding requests use the last real request as a template
Results from padding requests are discarded during aggregation
Without padding, distributed training would hang at synchronization barriers

Step 4: Parallel Inference

Execute model inference in parallel across all GPU ranks. Each rank independently runs its assigned requests through the model. Synchronization barriers (accelerator.wait_for_everyone() or dist.barrier()) ensure all ranks complete inference before proceeding to post-processing. For tensor-parallel models, the model itself is split across GPUs and all ranks process the same requests cooperatively.

Key considerations:

Data parallelism: each rank runs the full model on different data
Tensor parallelism: model layers are split across ranks for large models
Barriers after inference ensure all ranks complete before gathering results
GPU memory is freed after inference to allow post-processing (e.g., LLM-as-judge)

Step 5: Result Gathering

Collect per-rank results to rank 0 for final metric computation. Using torch.distributed.gather_object(), each rank sends its logged samples and per-metric score lists to rank 0. The gathered results are flattened (concatenated across ranks) to reconstruct the full evaluation dataset's results. Per-sample stability metrics are also gathered when using k-samples mode.

Key considerations:

Only rank 0 computes final aggregated metrics
Non-rank-0 processes return None from the evaluation
A final dist.barrier() ensures all processes synchronize before cleanup
Result gathering handles both logged samples and metric score lists

Step 6: Metric Aggregation

On rank 0, aggregate all gathered results into final evaluation metrics. This follows the same aggregation logic as single-GPU evaluation: bootstrap confidence intervals are computed, group-level metrics are consolidated from subtask results, and the results dictionary is assembled with configuration metadata, version information, and per-task scores. The formatted results table is printed and saved.

Key considerations:

Aggregation runs only on rank 0 (global_rank == 0)
Bootstrap iterations default to 100,000 for robust confidence intervals
Group metrics are computed from constituent subtask results
W&B logging and HuggingFace Hub pushing occur only on rank 0

Execution Diagram

GitHub URL

Workflow Repository