Environment:Volcengine Verl Ray Distributed Environment
sources: Repo|verl|https://github.com/volcengine/verl
domains: Infrastructure, Distributed_Training
last_updated: 2026-02-07 17:00 GMT
Overview
Ray distributed computing environment required for verl's single-controller orchestration of multi-GPU RL training.
Description
verl uses Ray >= 2.41.0 as its distributed computing framework. Ray manages GPU worker groups, data transfer between actors, and cluster resource allocation. The single-controller architecture dispatches work to Ray worker groups for actor, critic, reward model, and rollout workers. Environment variables like NCCL_DEBUG and NCCL_CUMEM_ENABLE are auto-configured.
Usage
Required for all distributed training workflows in verl. Needed even for single-node multi-GPU training.
System Requirements
- Linux
- Multi-GPU node or cluster with network connectivity
Dependencies
ray[default]>= 2.41.0
Credentials
RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES(optional, controls Ray GPU assignment)RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES(optional, for NPU)NCCL_DEBUG(auto-set to "WARN")NCCL_CUMEM_ENABLE(auto-set to "0")DIST_INIT_METHOD(distributed init method)RANK,WORLD_SIZE,LOCAL_RANK(auto-set by Ray)
Quick Install
pip install "ray[default]>=2.41.0"
Code Evidence
From verl/trainer/constants_ppo.py:23,31:
NCCL_DEBUG: "WARN"
NCCL_CUMEM_ENABLE: "0"
And from verl/utils/ray_utils.py:39,42:
RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES
RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES
Common Errors
- "Ray not initialized" -> Call ray.init() or check cluster connectivity
- "Insufficient resources" -> Check GPU availability with ray status
- "NCCL timeout" -> Check network connectivity and NCCL configuration
Compatibility Notes
Ray manages device visibility via environment variables. verl configures NCCL environment automatically. For Ascend NPU, HCCL is used instead of NCCL.