Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:Allenai Open instruct Ray Distributed

From Leeroopedia


Knowledge Sources
Domains Infrastructure, Distributed_Training
Last Updated 2026-02-07 00:00 GMT

Overview

Ray distributed computing framework required for GRPO multi-actor training with vLLM inference.

Description

GRPO training uses Ray to coordinate multiple actors: policy trainers (DeepSpeed), inference engines (vLLM via LLMRayActor), and data preparation actors. Ray handles GPU allocation, process group management, and collective communication for weight synchronization between training and inference. The cluster is initialized via `ray_node_setup.sh` on Beaker nodes.

Usage

Use this environment for GRPO reinforcement learning only. SFT, DPO, and Reward Model training use Accelerate/DeepSpeed directly without Ray. Ray is needed when training and inference must run on separate GPU sets simultaneously.

System Requirements

Category Requirement Notes
OS Linux Ray with GPU support requires Linux
Hardware Multiple GPUs Separate GPU sets for training and inference
Network High-speed interconnect For Ray object store and collective communication

Dependencies

Python Packages

  • `ray[default]` >= 2.49.2

Environment Variables

  • `RAY_CGRAPH_get_timeout` = 300 (seconds for computation graph timeout)
  • `NCCL_CUMEM_ENABLE` = 0 (set in ray_node_setup.sh)

Credentials

No credentials required for Ray itself. Inherited from Beaker environment.

Quick Install

# Ray is installed as part of the main project
uv sync

# Manual install
pip install "ray[default]>=2.49.2"

Code Evidence

Ray node setup from `configs/beaker_configs/ray_node_setup.sh:5-7`:

export NCCL_CUMEM_ENABLE=0

Ray collective import from `vllm_utils_workerwrap.py:29`:

import ray.util.collective as collective

Ray timeout configuration from `mason.py:98`:

"RAY_CGRAPH_get_timeout": "300",

Actor manager coordination from `actor_manager.py:43-50`:

class ActorManager:
    """Manages the lifecycle of Ray actors for distributed GRPO training."""

Common Errors

Error Message Cause Solution
Ray cluster timeout Node initialization too slow Check network connectivity; increase RAY_CGRAPH_get_timeout
GPU not visible to Ray actor Incorrect CUDA_VISIBLE_DEVICES Ensure GPU allocation matches Ray resource specification
Weight sync hangs NCCL process group conflict Pre-initialize torch.distributed before DeepSpeed (see Pre_Init_Torch_Distributed heuristic)

Compatibility Notes

  • Single-node: Ray can run on a single node with multiple GPUs. The head node and worker are the same process.
  • Multi-node: Requires Beaker multi-node experiment setup with `ray_node_setup.sh` on each node.
  • GPU allocation: Training and inference GPUs must be explicitly partitioned. Overlapping allocations cause OOM.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment