Environment:Allenai Open instruct Ray Distributed
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, Distributed_Training |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Ray distributed computing framework required for GRPO multi-actor training with vLLM inference.
Description
GRPO training uses Ray to coordinate multiple actors: policy trainers (DeepSpeed), inference engines (vLLM via LLMRayActor), and data preparation actors. Ray handles GPU allocation, process group management, and collective communication for weight synchronization between training and inference. The cluster is initialized via `ray_node_setup.sh` on Beaker nodes.
Usage
Use this environment for GRPO reinforcement learning only. SFT, DPO, and Reward Model training use Accelerate/DeepSpeed directly without Ray. Ray is needed when training and inference must run on separate GPU sets simultaneously.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux | Ray with GPU support requires Linux |
| Hardware | Multiple GPUs | Separate GPU sets for training and inference |
| Network | High-speed interconnect | For Ray object store and collective communication |
Dependencies
Python Packages
- `ray[default]` >= 2.49.2
Environment Variables
- `RAY_CGRAPH_get_timeout` = 300 (seconds for computation graph timeout)
- `NCCL_CUMEM_ENABLE` = 0 (set in ray_node_setup.sh)
Credentials
No credentials required for Ray itself. Inherited from Beaker environment.
Quick Install
# Ray is installed as part of the main project
uv sync
# Manual install
pip install "ray[default]>=2.49.2"
Code Evidence
Ray node setup from `configs/beaker_configs/ray_node_setup.sh:5-7`:
export NCCL_CUMEM_ENABLE=0
Ray collective import from `vllm_utils_workerwrap.py:29`:
import ray.util.collective as collective
Ray timeout configuration from `mason.py:98`:
"RAY_CGRAPH_get_timeout": "300",
Actor manager coordination from `actor_manager.py:43-50`:
class ActorManager:
"""Manages the lifecycle of Ray actors for distributed GRPO training."""
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| Ray cluster timeout | Node initialization too slow | Check network connectivity; increase RAY_CGRAPH_get_timeout |
| GPU not visible to Ray actor | Incorrect CUDA_VISIBLE_DEVICES | Ensure GPU allocation matches Ray resource specification |
| Weight sync hangs | NCCL process group conflict | Pre-initialize torch.distributed before DeepSpeed (see Pre_Init_Torch_Distributed heuristic) |
Compatibility Notes
- Single-node: Ray can run on a single node with multiple GPUs. The head node and worker are the same process.
- Multi-node: Requires Beaker multi-node experiment setup with `ray_node_setup.sh` on each node.
- GPU allocation: Training and inference GPUs must be explicitly partitioned. Overlapping allocations cause OOM.