Principle:Allenai Open instruct Distributed Policy Training
| Knowledge Sources | |
|---|---|
| Domains | Distributed Computing Reinforcement Learning |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Distributed policy training is the technique of using Ray actors and DeepSpeed to train an RL policy model across multiple GPUs and nodes, combining actor-based coordination with ZeRO-based memory optimization.
Description
GRPO training requires running forward and backward passes on a large language model while also managing reference model inference, weight synchronization, checkpointing, and learning rate scheduling. This is accomplished by deploying multiple PolicyTrainerRayProcess actors, each owning one GPU and managed by Ray, while using DeepSpeed for the underlying distributed training mechanics.
The architecture uses two distinct layers of parallelism:
- Ray actor layer: Each learner GPU runs as an independent Ray actor. The main process coordinates these actors by issuing remote calls (e.g.,
step(),save_model(),broadcast_to_vllm()). This layer handles lifecycle management, fault detection, and heterogeneous scheduling.
- DeepSpeed layer: Within the learner actors, DeepSpeed provides data-parallel training with ZeRO optimization. DeepSpeed handles gradient synchronization, optimizer state sharding, mixed-precision training, and gradient checkpointing. The ZeRO stage (0, 2, or 3) determines the memory-efficiency/communication tradeoff.
Key design decisions:
- One GPU per actor: Each PolicyTrainerRayProcess manages exactly one GPU. This simplifies resource management and allows fine-grained placement.
- Variable learners per node: Different nodes can have different numbers of learner GPUs (e.g.,
--num_learners_per_node 2 4means 2 learners on node 0 and 4 on node 1), enabling flexible hardware utilization. - Gradient checkpointing: Enabled by default to reduce memory usage at the cost of recomputing activations during the backward pass.
- Sequence parallelism: For very long sequences, multiple GPUs can be used to parallelize the sequence dimension using Ulysses-style attention splitting.
Usage
Distributed policy training is the standard approach for all GRPO training runs with more than one learner GPU. It is configured via the ExperimentConfig parameters: num_learners_per_node, deepspeed_stage, sequence_parallel_size, and gradient_checkpointing.
Theoretical Basis
DeepSpeed ZeRO Stages
Stage 0: No sharding. Each GPU holds full model, optimizer, and gradients.
Memory per GPU: M + O + G (where M=model, O=optimizer states, G=gradients)
Communication: standard AllReduce for gradients
Stage 2: Optimizer and gradient sharding.
Memory per GPU: M + (O + G) / N
Communication: ReduceScatter for gradients, AllGather for parameters
Stage 3: Full sharding (model + optimizer + gradients).
Memory per GPU: (M + O + G) / N
Communication: AllGather for forward/backward, ReduceScatter for gradients
ZeRO Partition Groups (ZPG)
The deepspeed_zpg parameter controls the ZeRO partition group size. Setting zpg=8 (default, typically one node) means that parameter sharding only occurs within groups of 8 GPUs, reducing cross-node communication:
With zpg=8 on 2 nodes of 8 GPUs each:
- GPUs 0-7 form one partition group
- GPUs 8-15 form another partition group
- AllGather/ReduceScatter only within groups (fast intra-node)
- AllReduce across groups for gradient synchronization
Ray + DeepSpeed Integration
The integration requires careful initialization ordering:
1. Ray spawns N actor processes (one per learner GPU)
2. Each actor initializes torch.distributed (NCCL backend)
3. Each actor calls deepspeed.init_distributed()
4. Each actor creates DeepSpeed engine with the policy model
5. Rank 0 actor creates an additional process group for vLLM weight sync
6. All actors synchronize via barrier
The pre-initialization of torch.distributed before DeepSpeed is critical to avoid NCCL hangs when multiple process groups exist.