Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:Volcengine Verl CUDA GPU Environment

From Leeroopedia


Metadata

Field Value
Sources verl|https://github.com/volcengine/verl
Domains Infrastructure, Deep_Learning
Last Updated 2026-02-07 17:00 GMT

Overview

Linux environment with NVIDIA CUDA GPU or Huawei Ascend NPU for RL training of LLMs.

Description

verl supports NVIDIA CUDA GPUs and Huawei Ascend NPUs. Device detection is automated via torch.cuda.is_available() and torch.npu.is_available(). For CUDA, compute capability detection uses torch.cuda.get_device_capability(). For Ascend NPU, IPC support requires Software >= 25.3.rc1 and CANN >= 8.3.rc1.

Usage

Required for all training workflows (GRPO, PPO, SFT, multi-turn). CPU fallback exists but is not practical for LLM training.

System Requirements

Component Requirement
OS Linux (Ubuntu recommended)
Hardware NVIDIA GPU (A100/H100 preferred, min 16GB VRAM) or Huawei Ascend NPU
Disk 50GB+ SSD for model checkpoints

Dependencies

  • torch (with CUDA)
  • torch_npu (for Ascend)
  • packaging

Credentials

  • CUDA_VISIBLE_DEVICES: Device selection for CUDA GPUs
  • ASCEND_RT_VISIBLE_DEVICES: Ascend device selection

Quick Install

pip install torch

Code Evidence

From verl/utils/device.py:

is_cuda_available = torch.cuda.is_available()
is_npu_available = is_torch_npu_available()

And the device name function:

def get_device_name() -> str:
    if is_cuda_available:
        device = "cuda"
    elif is_npu_available:
        device = "npu"
    else:
        device = "cpu"
    return device

And IPC version check from verl/utils/device.py:187-296:

if version.parse(software_base) >= version.parse("25.3.rc1"):
    if version.parse(cann_base) >= version.parse("8.3.rc1"):
        return True

Common Errors

Error Solution
"CUDA not available" Install CUDA toolkit + GPU driver
"IPC not supported on your devices" Update Ascend Software >= 25.3.rc1 and CANN >= 8.3.rc1
"No GPU/NPU detected" Check device drivers

Compatibility Notes

NVIDIA GPUs with compute capability >= 7.0 recommended. Ascend NPU requires torch_npu. CPU mode available but impractical for LLM training.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment