Environment:Volcengine Verl CUDA GPU Environment
Metadata
| Field | Value |
|---|---|
| Sources | verl|https://github.com/volcengine/verl |
| Domains | Infrastructure, Deep_Learning |
| Last Updated | 2026-02-07 17:00 GMT |
Overview
Linux environment with NVIDIA CUDA GPU or Huawei Ascend NPU for RL training of LLMs.
Description
verl supports NVIDIA CUDA GPUs and Huawei Ascend NPUs. Device detection is automated via torch.cuda.is_available() and torch.npu.is_available(). For CUDA, compute capability detection uses torch.cuda.get_device_capability(). For Ascend NPU, IPC support requires Software >= 25.3.rc1 and CANN >= 8.3.rc1.
Usage
Required for all training workflows (GRPO, PPO, SFT, multi-turn). CPU fallback exists but is not practical for LLM training.
System Requirements
| Component | Requirement |
|---|---|
| OS | Linux (Ubuntu recommended) |
| Hardware | NVIDIA GPU (A100/H100 preferred, min 16GB VRAM) or Huawei Ascend NPU |
| Disk | 50GB+ SSD for model checkpoints |
Dependencies
- torch (with CUDA)
- torch_npu (for Ascend)
- packaging
Credentials
- CUDA_VISIBLE_DEVICES: Device selection for CUDA GPUs
- ASCEND_RT_VISIBLE_DEVICES: Ascend device selection
Quick Install
pip install torch
Code Evidence
From verl/utils/device.py:
is_cuda_available = torch.cuda.is_available() is_npu_available = is_torch_npu_available()
And the device name function:
def get_device_name() -> str:
if is_cuda_available:
device = "cuda"
elif is_npu_available:
device = "npu"
else:
device = "cpu"
return device
And IPC version check from verl/utils/device.py:187-296:
if version.parse(software_base) >= version.parse("25.3.rc1"):
if version.parse(cann_base) >= version.parse("8.3.rc1"):
return True
Common Errors
| Error | Solution |
|---|---|
| "CUDA not available" | Install CUDA toolkit + GPU driver |
| "IPC not supported on your devices" | Update Ascend Software >= 25.3.rc1 and CANN >= 8.3.rc1 |
| "No GPU/NPU detected" | Check device drivers |
Compatibility Notes
NVIDIA GPUs with compute capability >= 7.0 recommended. Ascend NPU requires torch_npu. CPU mode available but impractical for LLM training.