Environment:Alibaba ROLL CUDA GPU Environment
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, Deep_Learning, GPU_Computing |
| Last Updated | 2026-02-07 19:00 GMT |
Overview
NVIDIA CUDA GPU environment with NCCL communication backend, PyTorch 2.6.0+ or 2.8.0, and CUDA 12.4+ for distributed LLM reinforcement learning training and inference.
Description
This environment provides the primary GPU-accelerated context for running the ROLL framework on NVIDIA hardware. It is built on NVIDIA CUDA with NCCL for distributed communication. The platform auto-detection in ROLL identifies NVIDIA GPUs via `torch.cuda.get_device_name()` and initializes the `CudaPlatform` singleton. BF16 mixed precision requires Ampere (A100) or newer GPUs. The framework sets numerous CUDA-specific environment variables including `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True`, disables `NCCL_NVLS_ENABLE`, and configures `TORCHINDUCTOR_COMPILE_THREADS=2`.
Usage
Use this environment for all ROLL training and inference pipelines on NVIDIA GPUs: RLVR, Agentic RL, DPO, SFT, Knowledge Distillation, and Reward Flow Diffusion. This is the primary and most fully-supported platform, with all backends available (vLLM, SGLang, Megatron, DeepSpeed, FSDP2).
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux (Ubuntu 20.04+) | Kernel 5.15+ recommended; Windows/macOS not supported |
| Hardware | NVIDIA GPU with CUDA support | A100/H100 preferred for BF16; V100 supported with FP16 |
| VRAM | Minimum 16GB per GPU | 40GB+ recommended for 7B+ models |
| CUDA | >= 12.4 | CUDA 12.9 for SGLang with cuda-bindings |
| cuDNN | >= 9.1.0 | Required for Flash Attention |
| Disk | 50GB+ SSD | For model checkpoints and datasets |
Dependencies
System Packages
- `cuda-toolkit` >= 12.4
- `cudnn` >= 9.1.0
- `nccl` (bundled with PyTorch or installed separately)
Python Packages
- `torch` == 2.6.0 or 2.8.0 (pinned per Docker image)
- `torchvision` == 0.21.0 or 0.23.0
- `torchaudio` == 2.6.0 or 2.8.0
- `flash-attn` (required for torch 2.6.0 images)
- `transformer-engine[pytorch]` == 2.2.0 (torch 2.6.0 only)
- `ray[default,cgraph]` == 2.48.0
- `numpy` >= 1.25, < 2.0
- `peft` == 0.12.0
- `accelerate` == 0.34.2
- `trl` == 0.9.6
- `datasets` == 3.1.0
- `hydra-core`
- `omegaconf`
- `deepspeed` == 0.16.4
Credentials
The following environment variables may be needed depending on workflow:
- `WORKER_NAME`: Worker identification in distributed setup (set internally by ROLL)
- `RANK`: Distributed training rank (set internally)
- `WORLD_SIZE`: Total number of processes (set internally)
- `LOCAL_RANK`: Local GPU rank (set internally)
- `CLUSTER_NAME`: Cluster identifier (set internally)
- `MODEL_DOWNLOAD_TYPE`: Set to `HUGGINGFACE_HUB` or `MODELSCOPE` for model download source
- `PROFILER_TIMELINE`: Set to `1` to enable timeline profiling
- `PROFILER_MEMORY`: Set to `1` to enable memory profiling
- `NCCL_CUMEM_ENABLE`: NCCL CUmem control (default `0`)
Quick Install
# Using pre-built Docker image (recommended)
docker run -dit --gpus all --ipc=host --shm-size=10gb \
roll-registry.cn-hangzhou.cr.aliyuncs.com/roll/pytorch:nvcr-24.05-py3-torch260-vllm084
# Or install from requirements
pip install -r requirements_torch260_vllm.txt
# Alternative: pip install -r requirements_torch260_sglang.txt
# Alternative: pip install -r requirements_torch280_vllm.txt
Code Evidence
Platform detection from `roll/platforms/__init__.py:28-33`:
if torch.cuda.is_available():
device_name = torch.cuda.get_device_name().upper()
if "NVIDIA" in device_name:
return CudaPlatform()
CUDA environment configuration from `roll/platforms/cuda.py:32-43`:
@classmethod
def get_custom_env_vars(cls) -> dict:
env_vars = {
"RAY_get_check_signal_interval_milliseconds": "1",
"PYTORCH_CUDA_ALLOC_CONF": "expandable_segments:True",
"NCCL_CUMEM_ENABLE": os.getenv("NCCL_CUMEM_ENABLE", "0"),
"NCCL_NVLS_ENABLE": "0",
}
return env_vars
vLLM disables expandable segments from `roll/third_party/vllm/__init__.py:53-56`:
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = ""
torch.cuda.memory._set_allocator_settings("expandable_segments:False")
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `RuntimeError: vLLM is not installed or not properly configured` | vLLM not installed or version mismatch | Install correct vLLM version: `pip install vllm==0.8.4` or `vllm==0.10.2` |
| `CUDA out of memory` | Insufficient GPU VRAM | Reduce batch size, enable CPU offloading, use FP8 quantization, or increase ZeRO level |
| `BackendCompilerFailed.__init__() missing 1 required positional argument` | Transformer Engine compile issue | Set `NVTE_TORCH_COMPILE: '0'` in system_envs config |
| `self.node2pg[node_rank] KeyError: 1` | device_mapping exceeds available GPUs | Ensure `max(device_mapping)` <= `total_gpu_nums` |
Compatibility Notes
- BF16 Precision: Requires Ampere (A100) or newer. V100 GPUs should use FP16 instead.
- FP8 Precision: Supported for dense models (per-tensor or per-block) and MoE models (per-block only).
- Docker: Pre-built images strongly recommended. Available at `roll-registry.cn-hangzhou.cr.aliyuncs.com/roll/pytorch`.
- Python: Requires Python >= 3.10 (target version in pyproject.toml).