Environment:Open compass VLMEvalKit GPU CUDA Environment
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, GPU_Computing, Distributed_Training |
| Last Updated | 2026-02-14 01:30 GMT |
Overview
NVIDIA GPU environment with CUDA, PyTorch, and NCCL backend for distributed multi-GPU VLM inference using `torchrun`.
Description
This environment defines the GPU hardware and distributed computing requirements for running local VLM models (as opposed to API-only models). VLMEvalKit uses `torchrun` to launch multiple model instances across GPUs on a single node. The framework automatically detects available GPUs via `nvidia-smi` or `CUDA_VISIBLE_DEVICES`, partitions work across ranks, and synchronizes results using PyTorch distributed barriers with NCCL backend. Each rank gets its own GPU partition calculated as `NGPU // LOCAL_WORLD_SIZE`.
Usage
Use this environment for any workflow that involves local VLM inference (Image Benchmark Evaluation, Video Benchmark Evaluation with local models). API-only evaluations do not require this environment. The `scripts/run.sh` launcher script automatically detects GPUs and invokes `torchrun`.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| Hardware | NVIDIA GPU(s) | One or more GPUs; VRAM depends on model size |
| Driver | NVIDIA GPU Driver | Must support the CUDA version used by PyTorch |
| CUDA | Compatible with PyTorch build | No specific version pinned; follows PyTorch requirements |
| OS | Linux | `nvidia-smi` and `torchrun` required |
| Network | NCCL | Required for multi-GPU distributed communication |
Dependencies
System Packages
- `nvidia-smi` (GPU detection in `run.py:14`)
- NVIDIA GPU drivers
- CUDA toolkit (matching PyTorch build)
Python Packages
- `torch` (with CUDA support)
- `torch.distributed` (NCCL backend, used in `run.py:244-248`)
- `torchvision`
Credentials
The following environment variables configure distributed execution:
- `CUDA_VISIBLE_DEVICES`: Restricts which GPUs are used (`run.py:9`)
- `RANK`: Process rank in distributed group, default 0 (`run.py:21`)
- `WORLD_SIZE`: Total number of processes, default 1 (`run.py:22`)
- `LOCAL_WORLD_SIZE`: Processes per node, default 1 (`run.py:23`)
- `LOCAL_RANK`: Local process rank, default 1 (`run.py:24`)
- `DIST_TIMEOUT`: Distributed communication timeout in seconds, default 3600 (`run.py:247`)
Quick Install
# Install PyTorch with CUDA support (example for CUDA 12.1)
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121
# Launch multi-GPU evaluation
bash scripts/run.sh --model MODEL_NAME --data DATASET_NAME
# Or manually with torchrun
GPU=$(nvidia-smi --list-gpus | wc -l)
torchrun --nproc-per-node=$GPU run.py --model MODEL_NAME --data DATASET_NAME
Code Evidence
GPU detection without importing torch from `run.py:8-18`:
def get_gpu_list():
CUDA_VISIBLE_DEVICES = os.environ.get('CUDA_VISIBLE_DEVICES', '')
if CUDA_VISIBLE_DEVICES != '':
gpu_list = [int(x) for x in CUDA_VISIBLE_DEVICES.split(',')]
return gpu_list
try:
ps = subprocess.Popen(('nvidia-smi', '--list-gpus'), stdout=subprocess.PIPE)
output = subprocess.check_output(('wc', '-l'), stdin=ps.stdout)
return list(range(int(output)))
except:
return []
GPU count assertion from `run.py:29`:
assert NGPU >= LOCAL_WORLD_SIZE, "The number of processes should be less than or equal to the number of GPUs"
NCCL distributed initialization from `run.py:243-248`:
if WORLD_SIZE > 1:
import torch.distributed as dist
dist.init_process_group(
backend='nccl',
timeout=datetime.timedelta(seconds=int(os.environ.get('DIST_TIMEOUT', 3600)))
)
GPU per-process partitioning from `run.py:27-35`:
if LOCAL_WORLD_SIZE > 1 and len(GPU_LIST):
NGPU = len(GPU_LIST)
GPU_PER_PROC = NGPU // LOCAL_WORLD_SIZE
DEVICE_START_IDX = GPU_PER_PROC * LOCAL_RANK
CUDA_VISIBLE_DEVICES = [str(i) for i in GPU_LIST[DEVICE_START_IDX: DEVICE_START_IDX + GPU_PER_PROC]]
os.environ['CUDA_VISIBLE_DEVICES'] = ','.join(CUDA_VISIBLE_DEVICES)
Multi-GPU launcher script `scripts/run.sh:1-4`:
#!/bin/bash
set -x
export GPU=$(nvidia-smi --list-gpus | wc -l)
torchrun --nproc-per-node=$GPU run.py ${@:1}
CUDA memory cleanup after each inference from `vlmeval/inference.py:168`:
torch.cuda.empty_cache()
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `The number of processes should be less than or equal to the number of GPUs` | More torchrun processes than available GPUs | Reduce `--nproc-per-node` to match GPU count |
| `RuntimeError: NCCL error` | NCCL communication failure | Check GPU driver, CUDA version, and network between GPUs |
| `CUDA out of memory` | Model too large for available VRAM | Use a smaller model, reduce batch size, or add more GPUs |
| `nvidia-smi: command not found` | NVIDIA drivers not installed | Install NVIDIA GPU drivers |
| Timeout during `dist.barrier()` | One process hung or crashed | Increase `DIST_TIMEOUT` env var (default 3600s) |
Compatibility Notes
- CPU-only systems: VLMEvalKit can run API-only evaluations without GPUs. PyTorch is optional at import time.
- SLURM clusters: Use `scripts/srun.sh` for SLURM-based multi-node launching.
- Multiple model instances: VLMEvalKit uses torchrun to run multiple independent model instances (one per GPU), not tensor parallelism. Each instance processes a data shard.