Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Environment:EvolvingLMMs Lab Lmms eval GPU Compute Environment

From Leeroopedia
Knowledge Sources
Domains Infrastructure, GPU_Computing, Distributed_Training
Last Updated 2026-02-14 00:00 GMT

Overview

NVIDIA GPU environment with CUDA support, distributed training via Accelerate or torchrun, and configurable device mapping for multi-GPU evaluation.

Description

This environment defines the GPU hardware and distributed computing requirements for running lmms-eval at production scale. It supports single-GPU, multi-GPU (data parallel), and multi-node distributed evaluation using either HuggingFace Accelerate or PyTorch torchrun backends. The framework relies on CUDA for GPU acceleration, uses SDPA (Scaled Dot-Product Attention) for memory-efficient inference on large models, and supports configurable attention implementations including flash_attention_2, sdpa, and eager modes. Default GPU memory utilization is set to 80% for vLLM/SGLang backends.

Usage

Use this environment for GPU-accelerated model evaluation, particularly when running large multimodal models (7B+ parameters). It is the mandatory prerequisite for the Distributed Multi-GPU Evaluation workflow and strongly recommended for the End-to-End Evaluation workflow with models larger than 3B parameters.

System Requirements

Category Requirement Notes
GPU NVIDIA GPU with CUDA support Compute capability 7.0+ recommended (Volta or newer)
VRAM 16GB minimum 80GB recommended for 34B+ models (A100/H100)
Multi-GPU Optional Requires NCCL backend for distributed
CUDA Compatible with PyTorch 2.1+ CUDA 11.8+ typical
OS Linux Distributed training not supported on Windows/macOS

Dependencies

System Packages

  • NVIDIA GPU driver (compatible with CUDA toolkit)
  • CUDA toolkit (version matching PyTorch build)
  • NCCL (for multi-GPU distributed training)

Python Packages

  • torch >= 2.1.0 — Built with CUDA support
  • accelerate >= 0.29.1 — HuggingFace distributed backend
  • torch.distributedPyTorch native distributed (torchrun backend)

Credentials

The following environment variables configure the GPU compute environment:

Device Configuration:

  • CUDA_VISIBLE_DEVICES: Controls which GPUs are visible to the process
  • LOCAL_RANK: Local process rank within node (default: 0)
  • RANK: Global process rank across all nodes (default: 0)
  • WORLD_SIZE: Total number of processes (default: 1)

Testing:

  • TEST_GPU_COUNT: Override GPU count for testing scenarios

Quick Install

# Verify CUDA availability
python -c "import torch; print(torch.cuda.is_available()); print(torch.cuda.device_count())"

# Single GPU evaluation
python -m lmms_eval --model qwen2_5_vl --tasks mme --device cuda:0

# Multi-GPU with accelerate
accelerate launch --num_processes=4 -m lmms_eval --model qwen2_5_vl --tasks mme

# Multi-GPU with torchrun
torchrun --nproc_per_node=4 -m lmms_eval --model qwen2_5_vl --tasks mme

Code Evidence

Distributed process initialization with 60,000-second timeout from lmms_eval/__main__.py:494-499:

if torch.distributed.is_available() and torch.distributed.is_initialized():
    accelerator = None
    is_main_process = torch.distributed.get_rank() == 0
else:
    kwargs_handler = InitProcessGroupKwargs(timeout=datetime.timedelta(seconds=60000))
    accelerator = Accelerator(kwargs_handlers=[kwargs_handler])

Distributed environment variable reading from lmms_eval/evaluator.py:275-277:

local_rank = int(os.environ.get("LOCAL_RANK", 0))
global_rank = int(os.environ.get("RANK", 0))
world_size = int(os.environ.get("WORLD_SIZE", 1))

Attention implementation validation from lmms_eval/models/simple/qwen2_5_vl.py:64-66:

valid_attn_implementations = [None, "flash_attention_2", "sdpa", "eager"]
if attn_implementation not in valid_attn_implementations:
    raise ValueError(f"attn_implementation must be one of {valid_attn_implementations}")

Dynamic attention based on PyTorch version from lmms_eval/models/simple/llama_vid.py:49-51:

attn_implementation=("sdpa" if torch.__version__ > "2.1.2" else "eager")

GPU memory utilization defaults from lmms_eval/models/chat/vllm.py:30:

gpu_memory_utilization=0.8

CUDA device assertion from lmms_eval/models/simple/videochat_flash.py:57:

assert torch.cuda.device_count() > 0, torch.cuda.device_count()

Common Errors

Error Message Cause Solution
CUDA out of memory Model too large for available VRAM Reduce max_pixels, decrease batch_size, or use multi-GPU
RuntimeError: NCCL error Distributed communication failure Check NCCL installation and network configuration
assert torch.cuda.device_count() > 0 No CUDA GPU detected Verify GPU driver and CUDA_VISIBLE_DEVICES
Distributed timeout Process hanging in synchronization Increase timeout (default 60000s) or check load balance
attn_implementation must be one of ... Invalid attention backend Use one of: None, flash_attention_2, sdpa, eager

Compatibility Notes

  • Attention backends: flash_attention_2 requires separate installation (pip install flash-attn). Falls back to sdpa (PyTorch 2.1.2+) or eager.
  • Distributed backends: accelerate is easier for single-node multi-GPU. torchrun is preferred for multi-node setups.
  • Memory management: vLLM and SGLang backends default to 80% GPU memory utilization. Adjustable via gpu_memory_utilization parameter.
  • Device mapping: Multi-GPU uses cuda:{local_process_index} per-rank. Single GPU uses device_map="auto".
  • Max pixels: Default 1,605,632 for vision models. Reduce for lower VRAM GPUs.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment