Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:Volcengine Verl Ray Distributed Environment

From Leeroopedia


sources: Repo|verl|https://github.com/volcengine/verl

domains: Infrastructure, Distributed_Training

last_updated: 2026-02-07 17:00 GMT

Overview

Ray distributed computing environment required for verl's single-controller orchestration of multi-GPU RL training.

Description

verl uses Ray >= 2.41.0 as its distributed computing framework. Ray manages GPU worker groups, data transfer between actors, and cluster resource allocation. The single-controller architecture dispatches work to Ray worker groups for actor, critic, reward model, and rollout workers. Environment variables like NCCL_DEBUG and NCCL_CUMEM_ENABLE are auto-configured.

Usage

Required for all distributed training workflows in verl. Needed even for single-node multi-GPU training.

System Requirements

  • Linux
  • Multi-GPU node or cluster with network connectivity

Dependencies

  • ray[default] >= 2.41.0

Credentials

  • RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES (optional, controls Ray GPU assignment)
  • RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES (optional, for NPU)
  • NCCL_DEBUG (auto-set to "WARN")
  • NCCL_CUMEM_ENABLE (auto-set to "0")
  • DIST_INIT_METHOD (distributed init method)
  • RANK, WORLD_SIZE, LOCAL_RANK (auto-set by Ray)

Quick Install

pip install "ray[default]>=2.41.0"

Code Evidence

From verl/trainer/constants_ppo.py:23,31:

NCCL_DEBUG: "WARN"
NCCL_CUMEM_ENABLE: "0"

And from verl/utils/ray_utils.py:39,42:

RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES
RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES

Common Errors

  • "Ray not initialized" -> Call ray.init() or check cluster connectivity
  • "Insufficient resources" -> Check GPU availability with ray status
  • "NCCL timeout" -> Check network connectivity and NCCL configuration

Compatibility Notes

Ray manages device visibility via environment variables. verl configures NCCL environment automatically. For Ascend NPU, HCCL is used instead of NCCL.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment