Environment:Allenai Open instruct Ray Distributed

Knowledge Sources	Open Instruct Ray
Domains	Infrastructure, Distributed_Training
Last Updated	2026-02-07 00:00 GMT

Overview

Ray distributed computing framework required for GRPO multi-actor training with vLLM inference.

Description

GRPO training uses Ray to coordinate multiple actors: policy trainers (DeepSpeed), inference engines (vLLM via LLMRayActor), and data preparation actors. Ray handles GPU allocation, process group management, and collective communication for weight synchronization between training and inference. The cluster is initialized via `ray_node_setup.sh` on Beaker nodes.

Usage

Use this environment for GRPO reinforcement learning only. SFT, DPO, and Reward Model training use Accelerate/DeepSpeed directly without Ray. Ray is needed when training and inference must run on separate GPU sets simultaneously.

System Requirements

Category	Requirement	Notes
OS	Linux	Ray with GPU support requires Linux
Hardware	Multiple GPUs	Separate GPU sets for training and inference
Network	High-speed interconnect	For Ray object store and collective communication

Dependencies

Python Packages

`ray[default]` >= 2.49.2

Environment Variables

`RAY_CGRAPH_get_timeout` = 300 (seconds for computation graph timeout)
`NCCL_CUMEM_ENABLE` = 0 (set in ray_node_setup.sh)

Credentials

No credentials required for Ray itself. Inherited from Beaker environment.

Quick Install

# Ray is installed as part of the main project
uv sync

# Manual install
pip install "ray[default]>=2.49.2"

Code Evidence

Ray node setup from `configs/beaker_configs/ray_node_setup.sh:5-7`:

export NCCL_CUMEM_ENABLE=0

Ray collective import from `vllm_utils_workerwrap.py:29`:

import ray.util.collective as collective

Ray timeout configuration from `mason.py:98`:

"RAY_CGRAPH_get_timeout": "300",

Actor manager coordination from `actor_manager.py:43-50`:

class ActorManager:
    """Manages the lifecycle of Ray actors for distributed GRPO training."""

Common Errors

Error Message	Cause	Solution
Ray cluster timeout	Node initialization too slow	Check network connectivity; increase RAY_CGRAPH_get_timeout
GPU not visible to Ray actor	Incorrect CUDA_VISIBLE_DEVICES	Ensure GPU allocation matches Ray resource specification
Weight sync hangs	NCCL process group conflict	Pre-initialize torch.distributed before DeepSpeed (see Pre_Init_Torch_Distributed heuristic)

Compatibility Notes

Single-node: Ray can run on a single node with multiple GPUs. The head node and worker are the same process.
Multi-node: Requires Beaker multi-node experiment setup with `ray_node_setup.sh` on each node.
GPU allocation: Training and inference GPUs must be explicitly partitioned. Overlapping allocations cause OOM.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment