Environment:EvolvingLMMs Lab Lmms eval GPU Compute Environment

Knowledge Sources	lmms-eval PyTorch Distributed
Domains	Infrastructure, GPU_Computing, Distributed_Training
Last Updated	2026-02-14 00:00 GMT

Overview

NVIDIA GPU environment with CUDA support, distributed training via Accelerate or torchrun, and configurable device mapping for multi-GPU evaluation.

Description

This environment defines the GPU hardware and distributed computing requirements for running lmms-eval at production scale. It supports single-GPU, multi-GPU (data parallel), and multi-node distributed evaluation using either HuggingFace Accelerate or PyTorch torchrun backends. The framework relies on CUDA for GPU acceleration, uses SDPA (Scaled Dot-Product Attention) for memory-efficient inference on large models, and supports configurable attention implementations including flash_attention_2, sdpa, and eager modes. Default GPU memory utilization is set to 80% for vLLM/SGLang backends.

Usage

Use this environment for GPU-accelerated model evaluation, particularly when running large multimodal models (7B+ parameters). It is the mandatory prerequisite for the Distributed Multi-GPU Evaluation workflow and strongly recommended for the End-to-End Evaluation workflow with models larger than 3B parameters.

System Requirements

Category	Requirement	Notes
GPU	NVIDIA GPU with CUDA support	Compute capability 7.0+ recommended (Volta or newer)
VRAM	16GB minimum	80GB recommended for 34B+ models (A100/H100)
Multi-GPU	Optional	Requires NCCL backend for distributed
CUDA	Compatible with PyTorch 2.1+	CUDA 11.8+ typical
OS	Linux	Distributed training not supported on Windows/macOS

Dependencies

System Packages

NVIDIA GPU driver (compatible with CUDA toolkit)
CUDA toolkit (version matching PyTorch build)
NCCL (for multi-GPU distributed training)

Python Packages

torch >= 2.1.0 — Built with CUDA support
accelerate >= 0.29.1 — HuggingFace distributed backend
torch.distributed — PyTorch native distributed (torchrun backend)

Credentials

The following environment variables configure the GPU compute environment:

Device Configuration:

CUDA_VISIBLE_DEVICES: Controls which GPUs are visible to the process
LOCAL_RANK: Local process rank within node (default: 0)
RANK: Global process rank across all nodes (default: 0)
WORLD_SIZE: Total number of processes (default: 1)

Testing:

TEST_GPU_COUNT: Override GPU count for testing scenarios

Quick Install

# Verify CUDA availability
python -c "import torch; print(torch.cuda.is_available()); print(torch.cuda.device_count())"

# Single GPU evaluation
python -m lmms_eval --model qwen2_5_vl --tasks mme --device cuda:0

# Multi-GPU with accelerate
accelerate launch --num_processes=4 -m lmms_eval --model qwen2_5_vl --tasks mme

# Multi-GPU with torchrun
torchrun --nproc_per_node=4 -m lmms_eval --model qwen2_5_vl --tasks mme

Code Evidence

Distributed process initialization with 60,000-second timeout from lmms_eval/__main__.py:494-499:

if torch.distributed.is_available() and torch.distributed.is_initialized():
    accelerator = None
    is_main_process = torch.distributed.get_rank() == 0
else:
    kwargs_handler = InitProcessGroupKwargs(timeout=datetime.timedelta(seconds=60000))
    accelerator = Accelerator(kwargs_handlers=[kwargs_handler])

Distributed environment variable reading from lmms_eval/evaluator.py:275-277:

local_rank = int(os.environ.get("LOCAL_RANK", 0))
global_rank = int(os.environ.get("RANK", 0))
world_size = int(os.environ.get("WORLD_SIZE", 1))

Attention implementation validation from lmms_eval/models/simple/qwen2_5_vl.py:64-66:

valid_attn_implementations = [None, "flash_attention_2", "sdpa", "eager"]
if attn_implementation not in valid_attn_implementations:
    raise ValueError(f"attn_implementation must be one of {valid_attn_implementations}")

Dynamic attention based on PyTorch version from lmms_eval/models/simple/llama_vid.py:49-51:

attn_implementation=("sdpa" if torch.__version__ > "2.1.2" else "eager")

GPU memory utilization defaults from lmms_eval/models/chat/vllm.py:30:

gpu_memory_utilization=0.8

CUDA device assertion from lmms_eval/models/simple/videochat_flash.py:57:

assert torch.cuda.device_count() > 0, torch.cuda.device_count()

Common Errors

Error Message	Cause	Solution
`CUDA out of memory`	Model too large for available VRAM	Reduce `max_pixels`, decrease `batch_size`, or use multi-GPU
`RuntimeError: NCCL error`	Distributed communication failure	Check NCCL installation and network configuration
`assert torch.cuda.device_count() > 0`	No CUDA GPU detected	Verify GPU driver and CUDA_VISIBLE_DEVICES
Distributed timeout	Process hanging in synchronization	Increase timeout (default 60000s) or check load balance
`attn_implementation must be one of ...`	Invalid attention backend	Use one of: `None`, `flash_attention_2`, `sdpa`, `eager`

Compatibility Notes

Attention backends: flash_attention_2 requires separate installation (pip install flash-attn). Falls back to sdpa (PyTorch 2.1.2+) or eager.
Distributed backends: accelerate is easier for single-node multi-GPU. torchrun is preferred for multi-node setups.
Memory management: vLLM and SGLang backends default to 80% GPU memory utilization. Adjustable via gpu_memory_utilization parameter.
Device mapping: Multi-GPU uses cuda:{local_process_index} per-rank. Single GPU uses device_map="auto".
Max pixels: Default 1,605,632 for vision models. Reduce for lower VRAM GPUs.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment