Environment:Alibaba ROLL CUDA GPU Environment

Knowledge Sources	Alibaba ROLL NVIDIA CUDA Toolkit
Domains	Infrastructure, Deep_Learning, GPU_Computing
Last Updated	2026-02-07 19:00 GMT

Overview

NVIDIA CUDA GPU environment with NCCL communication backend, PyTorch 2.6.0+ or 2.8.0, and CUDA 12.4+ for distributed LLM reinforcement learning training and inference.

Description

This environment provides the primary GPU-accelerated context for running the ROLL framework on NVIDIA hardware. It is built on NVIDIA CUDA with NCCL for distributed communication. The platform auto-detection in ROLL identifies NVIDIA GPUs via `torch.cuda.get_device_name()` and initializes the `CudaPlatform` singleton. BF16 mixed precision requires Ampere (A100) or newer GPUs. The framework sets numerous CUDA-specific environment variables including `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True`, disables `NCCL_NVLS_ENABLE`, and configures `TORCHINDUCTOR_COMPILE_THREADS=2`.

Usage

Use this environment for all ROLL training and inference pipelines on NVIDIA GPUs: RLVR, Agentic RL, DPO, SFT, Knowledge Distillation, and Reward Flow Diffusion. This is the primary and most fully-supported platform, with all backends available (vLLM, SGLang, Megatron, DeepSpeed, FSDP2).

System Requirements

Category	Requirement	Notes
OS	Linux (Ubuntu 20.04+)	Kernel 5.15+ recommended; Windows/macOS not supported
Hardware	NVIDIA GPU with CUDA support	A100/H100 preferred for BF16; V100 supported with FP16
VRAM	Minimum 16GB per GPU	40GB+ recommended for 7B+ models
CUDA	>= 12.4	CUDA 12.9 for SGLang with cuda-bindings
cuDNN	>= 9.1.0	Required for Flash Attention
Disk	50GB+ SSD	For model checkpoints and datasets

Dependencies

System Packages

`cuda-toolkit` >= 12.4
`cudnn` >= 9.1.0
`nccl` (bundled with PyTorch or installed separately)

Python Packages

`torch` == 2.6.0 or 2.8.0 (pinned per Docker image)
`torchvision` == 0.21.0 or 0.23.0
`torchaudio` == 2.6.0 or 2.8.0
`flash-attn` (required for torch 2.6.0 images)
`transformer-engine[pytorch]` == 2.2.0 (torch 2.6.0 only)
`ray[default,cgraph]` == 2.48.0
`numpy` >= 1.25, < 2.0
`peft` == 0.12.0
`accelerate` == 0.34.2
`trl` == 0.9.6
`datasets` == 3.1.0
`hydra-core`
`omegaconf`
`deepspeed` == 0.16.4

Credentials

The following environment variables may be needed depending on workflow:

`WORKER_NAME`: Worker identification in distributed setup (set internally by ROLL)
`RANK`: Distributed training rank (set internally)
`WORLD_SIZE`: Total number of processes (set internally)
`LOCAL_RANK`: Local GPU rank (set internally)
`CLUSTER_NAME`: Cluster identifier (set internally)
`MODEL_DOWNLOAD_TYPE`: Set to `HUGGINGFACE_HUB` or `MODELSCOPE` for model download source
`PROFILER_TIMELINE`: Set to `1` to enable timeline profiling
`PROFILER_MEMORY`: Set to `1` to enable memory profiling
`NCCL_CUMEM_ENABLE`: NCCL CUmem control (default `0`)

Quick Install

# Using pre-built Docker image (recommended)
docker run -dit --gpus all --ipc=host --shm-size=10gb \
  roll-registry.cn-hangzhou.cr.aliyuncs.com/roll/pytorch:nvcr-24.05-py3-torch260-vllm084

# Or install from requirements
pip install -r requirements_torch260_vllm.txt
# Alternative: pip install -r requirements_torch260_sglang.txt
# Alternative: pip install -r requirements_torch280_vllm.txt

Code Evidence

Platform detection from `roll/platforms/__init__.py:28-33`:

if torch.cuda.is_available():
    device_name = torch.cuda.get_device_name().upper()
    if "NVIDIA" in device_name:
        return CudaPlatform()

CUDA environment configuration from `roll/platforms/cuda.py:32-43`:

@classmethod
def get_custom_env_vars(cls) -> dict:
    env_vars = {
        "RAY_get_check_signal_interval_milliseconds": "1",
        "PYTORCH_CUDA_ALLOC_CONF": "expandable_segments:True",
        "NCCL_CUMEM_ENABLE": os.getenv("NCCL_CUMEM_ENABLE", "0"),
        "NCCL_NVLS_ENABLE": "0",
    }
    return env_vars

vLLM disables expandable segments from `roll/third_party/vllm/__init__.py:53-56`:

os.environ["PYTORCH_CUDA_ALLOC_CONF"] = ""
torch.cuda.memory._set_allocator_settings("expandable_segments:False")

Common Errors

Error Message	Cause	Solution
`RuntimeError: vLLM is not installed or not properly configured`	vLLM not installed or version mismatch	Install correct vLLM version: `pip install vllm==0.8.4` or `vllm==0.10.2`
`CUDA out of memory`	Insufficient GPU VRAM	Reduce batch size, enable CPU offloading, use FP8 quantization, or increase ZeRO level
`BackendCompilerFailed.__init__() missing 1 required positional argument`	Transformer Engine compile issue	Set `NVTE_TORCH_COMPILE: '0'` in system_envs config
`self.node2pg[node_rank] KeyError: 1`	device_mapping exceeds available GPUs	Ensure `max(device_mapping)` <= `total_gpu_nums`

Compatibility Notes

BF16 Precision: Requires Ampere (A100) or newer. V100 GPUs should use FP16 instead.
FP8 Precision: Supported for dense models (per-tensor or per-block) and MoE models (per-block only).
Docker: Pre-built images strongly recommended. Available at `roll-registry.cn-hangzhou.cr.aliyuncs.com/roll/pytorch`.
Python: Requires Python >= 3.10 (target version in pyproject.toml).

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment