Principle:Allenai Open instruct Distributed Policy Training

Knowledge Sources	ZeRO: Memory Optimizations Toward Training Trillion Parameter Models Ray: A Distributed Framework for Emerging AI Applications DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters
Domains	Distributed Computing Reinforcement Learning
Last Updated	2026-02-07 00:00 GMT

Overview

Distributed policy training is the technique of using Ray actors and DeepSpeed to train an RL policy model across multiple GPUs and nodes, combining actor-based coordination with ZeRO-based memory optimization.

Description

GRPO training requires running forward and backward passes on a large language model while also managing reference model inference, weight synchronization, checkpointing, and learning rate scheduling. This is accomplished by deploying multiple PolicyTrainerRayProcess actors, each owning one GPU and managed by Ray, while using DeepSpeed for the underlying distributed training mechanics.

The architecture uses two distinct layers of parallelism:

Ray actor layer: Each learner GPU runs as an independent Ray actor. The main process coordinates these actors by issuing remote calls (e.g., step(), save_model(), broadcast_to_vllm()). This layer handles lifecycle management, fault detection, and heterogeneous scheduling.

DeepSpeed layer: Within the learner actors, DeepSpeed provides data-parallel training with ZeRO optimization. DeepSpeed handles gradient synchronization, optimizer state sharding, mixed-precision training, and gradient checkpointing. The ZeRO stage (0, 2, or 3) determines the memory-efficiency/communication tradeoff.

Key design decisions:

One GPU per actor: Each PolicyTrainerRayProcess manages exactly one GPU. This simplifies resource management and allows fine-grained placement.
Variable learners per node: Different nodes can have different numbers of learner GPUs (e.g., --num_learners_per_node 2 4 means 2 learners on node 0 and 4 on node 1), enabling flexible hardware utilization.
Gradient checkpointing: Enabled by default to reduce memory usage at the cost of recomputing activations during the backward pass.
Sequence parallelism: For very long sequences, multiple GPUs can be used to parallelize the sequence dimension using Ulysses-style attention splitting.

Usage

Distributed policy training is the standard approach for all GRPO training runs with more than one learner GPU. It is configured via the ExperimentConfig parameters: num_learners_per_node, deepspeed_stage, sequence_parallel_size, and gradient_checkpointing.

Theoretical Basis

DeepSpeed ZeRO Stages

Stage 0: No sharding. Each GPU holds full model, optimizer, and gradients.
  Memory per GPU: M + O + G  (where M=model, O=optimizer states, G=gradients)
  Communication: standard AllReduce for gradients

Stage 2: Optimizer and gradient sharding.
  Memory per GPU: M + (O + G) / N
  Communication: ReduceScatter for gradients, AllGather for parameters

Stage 3: Full sharding (model + optimizer + gradients).
  Memory per GPU: (M + O + G) / N
  Communication: AllGather for forward/backward, ReduceScatter for gradients

ZeRO Partition Groups (ZPG)

The deepspeed_zpg parameter controls the ZeRO partition group size. Setting zpg=8 (default, typically one node) means that parameter sharding only occurs within groups of 8 GPUs, reducing cross-node communication:

With zpg=8 on 2 nodes of 8 GPUs each:
  - GPUs 0-7 form one partition group
  - GPUs 8-15 form another partition group
  - AllGather/ReduceScatter only within groups (fast intra-node)
  - AllReduce across groups for gradient synchronization

Ray + DeepSpeed Integration

The integration requires careful initialization ordering:

1. Ray spawns N actor processes (one per learner GPU)
2. Each actor initializes torch.distributed (NCCL backend)
3. Each actor calls deepspeed.init_distributed()
4. Each actor creates DeepSpeed engine with the policy model
5. Rank 0 actor creates an additional process group for vLLM weight sync
6. All actors synchronize via barrier

The pre-initialization of torch.distributed before DeepSpeed is critical to avoid NCCL hangs when multiple process groups exist.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment