Environment:EvolvingLMMs Lab Lmms eval GPU Compute Environment
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, GPU_Computing, Distributed_Training |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
NVIDIA GPU environment with CUDA support, distributed training via Accelerate or torchrun, and configurable device mapping for multi-GPU evaluation.
Description
This environment defines the GPU hardware and distributed computing requirements for running lmms-eval at production scale. It supports single-GPU, multi-GPU (data parallel), and multi-node distributed evaluation using either HuggingFace Accelerate or PyTorch torchrun backends. The framework relies on CUDA for GPU acceleration, uses SDPA (Scaled Dot-Product Attention) for memory-efficient inference on large models, and supports configurable attention implementations including flash_attention_2, sdpa, and eager modes. Default GPU memory utilization is set to 80% for vLLM/SGLang backends.
Usage
Use this environment for GPU-accelerated model evaluation, particularly when running large multimodal models (7B+ parameters). It is the mandatory prerequisite for the Distributed Multi-GPU Evaluation workflow and strongly recommended for the End-to-End Evaluation workflow with models larger than 3B parameters.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| GPU | NVIDIA GPU with CUDA support | Compute capability 7.0+ recommended (Volta or newer) |
| VRAM | 16GB minimum | 80GB recommended for 34B+ models (A100/H100) |
| Multi-GPU | Optional | Requires NCCL backend for distributed |
| CUDA | Compatible with PyTorch 2.1+ | CUDA 11.8+ typical |
| OS | Linux | Distributed training not supported on Windows/macOS |
Dependencies
System Packages
- NVIDIA GPU driver (compatible with CUDA toolkit)
- CUDA toolkit (version matching PyTorch build)
- NCCL (for multi-GPU distributed training)
Python Packages
torch>= 2.1.0 — Built with CUDA supportaccelerate>= 0.29.1 — HuggingFace distributed backendtorch.distributed— PyTorch native distributed (torchrun backend)
Credentials
The following environment variables configure the GPU compute environment:
Device Configuration:
CUDA_VISIBLE_DEVICES: Controls which GPUs are visible to the processLOCAL_RANK: Local process rank within node (default: 0)RANK: Global process rank across all nodes (default: 0)WORLD_SIZE: Total number of processes (default: 1)
Testing:
TEST_GPU_COUNT: Override GPU count for testing scenarios
Quick Install
# Verify CUDA availability
python -c "import torch; print(torch.cuda.is_available()); print(torch.cuda.device_count())"
# Single GPU evaluation
python -m lmms_eval --model qwen2_5_vl --tasks mme --device cuda:0
# Multi-GPU with accelerate
accelerate launch --num_processes=4 -m lmms_eval --model qwen2_5_vl --tasks mme
# Multi-GPU with torchrun
torchrun --nproc_per_node=4 -m lmms_eval --model qwen2_5_vl --tasks mme
Code Evidence
Distributed process initialization with 60,000-second timeout from lmms_eval/__main__.py:494-499:
if torch.distributed.is_available() and torch.distributed.is_initialized():
accelerator = None
is_main_process = torch.distributed.get_rank() == 0
else:
kwargs_handler = InitProcessGroupKwargs(timeout=datetime.timedelta(seconds=60000))
accelerator = Accelerator(kwargs_handlers=[kwargs_handler])
Distributed environment variable reading from lmms_eval/evaluator.py:275-277:
local_rank = int(os.environ.get("LOCAL_RANK", 0))
global_rank = int(os.environ.get("RANK", 0))
world_size = int(os.environ.get("WORLD_SIZE", 1))
Attention implementation validation from lmms_eval/models/simple/qwen2_5_vl.py:64-66:
valid_attn_implementations = [None, "flash_attention_2", "sdpa", "eager"]
if attn_implementation not in valid_attn_implementations:
raise ValueError(f"attn_implementation must be one of {valid_attn_implementations}")
Dynamic attention based on PyTorch version from lmms_eval/models/simple/llama_vid.py:49-51:
attn_implementation=("sdpa" if torch.__version__ > "2.1.2" else "eager")
GPU memory utilization defaults from lmms_eval/models/chat/vllm.py:30:
gpu_memory_utilization=0.8
CUDA device assertion from lmms_eval/models/simple/videochat_flash.py:57:
assert torch.cuda.device_count() > 0, torch.cuda.device_count()
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
CUDA out of memory |
Model too large for available VRAM | Reduce max_pixels, decrease batch_size, or use multi-GPU
|
RuntimeError: NCCL error |
Distributed communication failure | Check NCCL installation and network configuration |
assert torch.cuda.device_count() > 0 |
No CUDA GPU detected | Verify GPU driver and CUDA_VISIBLE_DEVICES |
| Distributed timeout | Process hanging in synchronization | Increase timeout (default 60000s) or check load balance |
attn_implementation must be one of ... |
Invalid attention backend | Use one of: None, flash_attention_2, sdpa, eager
|
Compatibility Notes
- Attention backends:
flash_attention_2requires separate installation (pip install flash-attn). Falls back tosdpa(PyTorch 2.1.2+) oreager. - Distributed backends:
accelerateis easier for single-node multi-GPU.torchrunis preferred for multi-node setups. - Memory management: vLLM and SGLang backends default to 80% GPU memory utilization. Adjustable via
gpu_memory_utilizationparameter. - Device mapping: Multi-GPU uses
cuda:{local_process_index}per-rank. Single GPU usesdevice_map="auto". - Max pixels: Default 1,605,632 for vision models. Reduce for lower VRAM GPUs.