Environment:Open compass VLMEvalKit GPU CUDA Environment

Knowledge Sources	VLMEvalKit PyTorch CUDA
Domains	Infrastructure, GPU_Computing, Distributed_Training
Last Updated	2026-02-14 01:30 GMT

Overview

NVIDIA GPU environment with CUDA, PyTorch, and NCCL backend for distributed multi-GPU VLM inference using `torchrun`.

Description

This environment defines the GPU hardware and distributed computing requirements for running local VLM models (as opposed to API-only models). VLMEvalKit uses `torchrun` to launch multiple model instances across GPUs on a single node. The framework automatically detects available GPUs via `nvidia-smi` or `CUDA_VISIBLE_DEVICES`, partitions work across ranks, and synchronizes results using PyTorch distributed barriers with NCCL backend. Each rank gets its own GPU partition calculated as `NGPU // LOCAL_WORLD_SIZE`.

Usage

Use this environment for any workflow that involves local VLM inference (Image Benchmark Evaluation, Video Benchmark Evaluation with local models). API-only evaluations do not require this environment. The `scripts/run.sh` launcher script automatically detects GPUs and invokes `torchrun`.

System Requirements

Category	Requirement	Notes
Hardware	NVIDIA GPU(s)	One or more GPUs; VRAM depends on model size
Driver	NVIDIA GPU Driver	Must support the CUDA version used by PyTorch
CUDA	Compatible with PyTorch build	No specific version pinned; follows PyTorch requirements
OS	Linux	`nvidia-smi` and `torchrun` required
Network	NCCL	Required for multi-GPU distributed communication

Dependencies

System Packages

`nvidia-smi` (GPU detection in `run.py:14`)
NVIDIA GPU drivers
CUDA toolkit (matching PyTorch build)

Python Packages

`torch` (with CUDA support)
`torch.distributed` (NCCL backend, used in `run.py:244-248`)
`torchvision`

Credentials

The following environment variables configure distributed execution:

`CUDA_VISIBLE_DEVICES`: Restricts which GPUs are used (`run.py:9`)
`RANK`: Process rank in distributed group, default 0 (`run.py:21`)
`WORLD_SIZE`: Total number of processes, default 1 (`run.py:22`)
`LOCAL_WORLD_SIZE`: Processes per node, default 1 (`run.py:23`)
`LOCAL_RANK`: Local process rank, default 1 (`run.py:24`)
`DIST_TIMEOUT`: Distributed communication timeout in seconds, default 3600 (`run.py:247`)

Quick Install

# Install PyTorch with CUDA support (example for CUDA 12.1)
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121

# Launch multi-GPU evaluation
bash scripts/run.sh --model MODEL_NAME --data DATASET_NAME

# Or manually with torchrun
GPU=$(nvidia-smi --list-gpus | wc -l)
torchrun --nproc-per-node=$GPU run.py --model MODEL_NAME --data DATASET_NAME

Code Evidence

GPU detection without importing torch from `run.py:8-18`:

def get_gpu_list():
    CUDA_VISIBLE_DEVICES = os.environ.get('CUDA_VISIBLE_DEVICES', '')
    if CUDA_VISIBLE_DEVICES != '':
        gpu_list = [int(x) for x in CUDA_VISIBLE_DEVICES.split(',')]
        return gpu_list
    try:
        ps = subprocess.Popen(('nvidia-smi', '--list-gpus'), stdout=subprocess.PIPE)
        output = subprocess.check_output(('wc', '-l'), stdin=ps.stdout)
        return list(range(int(output)))
    except:
        return []

GPU count assertion from `run.py:29`:

assert NGPU >= LOCAL_WORLD_SIZE, "The number of processes should be less than or equal to the number of GPUs"

NCCL distributed initialization from `run.py:243-248`:

if WORLD_SIZE > 1:
    import torch.distributed as dist
    dist.init_process_group(
        backend='nccl',
        timeout=datetime.timedelta(seconds=int(os.environ.get('DIST_TIMEOUT', 3600)))
    )

GPU per-process partitioning from `run.py:27-35`:

if LOCAL_WORLD_SIZE > 1 and len(GPU_LIST):
    NGPU = len(GPU_LIST)
    GPU_PER_PROC = NGPU // LOCAL_WORLD_SIZE
    DEVICE_START_IDX = GPU_PER_PROC * LOCAL_RANK
    CUDA_VISIBLE_DEVICES = [str(i) for i in GPU_LIST[DEVICE_START_IDX: DEVICE_START_IDX + GPU_PER_PROC]]
    os.environ['CUDA_VISIBLE_DEVICES'] = ','.join(CUDA_VISIBLE_DEVICES)

Multi-GPU launcher script `scripts/run.sh:1-4`:

#!/bin/bash
set -x
export GPU=$(nvidia-smi --list-gpus | wc -l)
torchrun --nproc-per-node=$GPU run.py ${@:1}

CUDA memory cleanup after each inference from `vlmeval/inference.py:168`:

torch.cuda.empty_cache()

Common Errors

Error Message	Cause	Solution
`The number of processes should be less than or equal to the number of GPUs`	More torchrun processes than available GPUs	Reduce `--nproc-per-node` to match GPU count
`RuntimeError: NCCL error`	NCCL communication failure	Check GPU driver, CUDA version, and network between GPUs
`CUDA out of memory`	Model too large for available VRAM	Use a smaller model, reduce batch size, or add more GPUs
`nvidia-smi: command not found`	NVIDIA drivers not installed	Install NVIDIA GPU drivers
Timeout during `dist.barrier()`	One process hung or crashed	Increase `DIST_TIMEOUT` env var (default 3600s)

Compatibility Notes

CPU-only systems: VLMEvalKit can run API-only evaluations without GPUs. PyTorch is optional at import time.
SLURM clusters: Use `scripts/srun.sh` for SLURM-based multi-node launching.
Multiple model instances: VLMEvalKit uses torchrun to run multiple independent model instances (one per GPU), not tensor parallelism. Each instance processes a data shard.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment