Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:Alibaba ROLL CUDA GPU Environment

From Leeroopedia


Knowledge Sources
Domains Infrastructure, Deep_Learning, GPU_Computing
Last Updated 2026-02-07 19:00 GMT

Overview

NVIDIA CUDA GPU environment with NCCL communication backend, PyTorch 2.6.0+ or 2.8.0, and CUDA 12.4+ for distributed LLM reinforcement learning training and inference.

Description

This environment provides the primary GPU-accelerated context for running the ROLL framework on NVIDIA hardware. It is built on NVIDIA CUDA with NCCL for distributed communication. The platform auto-detection in ROLL identifies NVIDIA GPUs via `torch.cuda.get_device_name()` and initializes the `CudaPlatform` singleton. BF16 mixed precision requires Ampere (A100) or newer GPUs. The framework sets numerous CUDA-specific environment variables including `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True`, disables `NCCL_NVLS_ENABLE`, and configures `TORCHINDUCTOR_COMPILE_THREADS=2`.

Usage

Use this environment for all ROLL training and inference pipelines on NVIDIA GPUs: RLVR, Agentic RL, DPO, SFT, Knowledge Distillation, and Reward Flow Diffusion. This is the primary and most fully-supported platform, with all backends available (vLLM, SGLang, Megatron, DeepSpeed, FSDP2).

System Requirements

Category Requirement Notes
OS Linux (Ubuntu 20.04+) Kernel 5.15+ recommended; Windows/macOS not supported
Hardware NVIDIA GPU with CUDA support A100/H100 preferred for BF16; V100 supported with FP16
VRAM Minimum 16GB per GPU 40GB+ recommended for 7B+ models
CUDA >= 12.4 CUDA 12.9 for SGLang with cuda-bindings
cuDNN >= 9.1.0 Required for Flash Attention
Disk 50GB+ SSD For model checkpoints and datasets

Dependencies

System Packages

  • `cuda-toolkit` >= 12.4
  • `cudnn` >= 9.1.0
  • `nccl` (bundled with PyTorch or installed separately)

Python Packages

  • `torch` == 2.6.0 or 2.8.0 (pinned per Docker image)
  • `torchvision` == 0.21.0 or 0.23.0
  • `torchaudio` == 2.6.0 or 2.8.0
  • `flash-attn` (required for torch 2.6.0 images)
  • `transformer-engine[pytorch]` == 2.2.0 (torch 2.6.0 only)
  • `ray[default,cgraph]` == 2.48.0
  • `numpy` >= 1.25, < 2.0
  • `peft` == 0.12.0
  • `accelerate` == 0.34.2
  • `trl` == 0.9.6
  • `datasets` == 3.1.0
  • `hydra-core`
  • `omegaconf`
  • `deepspeed` == 0.16.4

Credentials

The following environment variables may be needed depending on workflow:

  • `WORKER_NAME`: Worker identification in distributed setup (set internally by ROLL)
  • `RANK`: Distributed training rank (set internally)
  • `WORLD_SIZE`: Total number of processes (set internally)
  • `LOCAL_RANK`: Local GPU rank (set internally)
  • `CLUSTER_NAME`: Cluster identifier (set internally)
  • `MODEL_DOWNLOAD_TYPE`: Set to `HUGGINGFACE_HUB` or `MODELSCOPE` for model download source
  • `PROFILER_TIMELINE`: Set to `1` to enable timeline profiling
  • `PROFILER_MEMORY`: Set to `1` to enable memory profiling
  • `NCCL_CUMEM_ENABLE`: NCCL CUmem control (default `0`)

Quick Install

# Using pre-built Docker image (recommended)
docker run -dit --gpus all --ipc=host --shm-size=10gb \
  roll-registry.cn-hangzhou.cr.aliyuncs.com/roll/pytorch:nvcr-24.05-py3-torch260-vllm084

# Or install from requirements
pip install -r requirements_torch260_vllm.txt
# Alternative: pip install -r requirements_torch260_sglang.txt
# Alternative: pip install -r requirements_torch280_vllm.txt

Code Evidence

Platform detection from `roll/platforms/__init__.py:28-33`:

if torch.cuda.is_available():
    device_name = torch.cuda.get_device_name().upper()
    if "NVIDIA" in device_name:
        return CudaPlatform()

CUDA environment configuration from `roll/platforms/cuda.py:32-43`:

@classmethod
def get_custom_env_vars(cls) -> dict:
    env_vars = {
        "RAY_get_check_signal_interval_milliseconds": "1",
        "PYTORCH_CUDA_ALLOC_CONF": "expandable_segments:True",
        "NCCL_CUMEM_ENABLE": os.getenv("NCCL_CUMEM_ENABLE", "0"),
        "NCCL_NVLS_ENABLE": "0",
    }
    return env_vars

vLLM disables expandable segments from `roll/third_party/vllm/__init__.py:53-56`:

os.environ["PYTORCH_CUDA_ALLOC_CONF"] = ""
torch.cuda.memory._set_allocator_settings("expandable_segments:False")

Common Errors

Error Message Cause Solution
`RuntimeError: vLLM is not installed or not properly configured` vLLM not installed or version mismatch Install correct vLLM version: `pip install vllm==0.8.4` or `vllm==0.10.2`
`CUDA out of memory` Insufficient GPU VRAM Reduce batch size, enable CPU offloading, use FP8 quantization, or increase ZeRO level
`BackendCompilerFailed.__init__() missing 1 required positional argument` Transformer Engine compile issue Set `NVTE_TORCH_COMPILE: '0'` in system_envs config
`self.node2pg[node_rank] KeyError: 1` device_mapping exceeds available GPUs Ensure `max(device_mapping)` <= `total_gpu_nums`

Compatibility Notes

  • BF16 Precision: Requires Ampere (A100) or newer. V100 GPUs should use FP16 instead.
  • FP8 Precision: Supported for dense models (per-tensor or per-block) and MoE models (per-block only).
  • Docker: Pre-built images strongly recommended. Available at `roll-registry.cn-hangzhou.cr.aliyuncs.com/roll/pytorch`.
  • Python: Requires Python >= 3.10 (target version in pyproject.toml).

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment