Principle:Alibaba ROLL Policy Gradient Optimization
| Knowledge Sources | |
|---|---|
| Domains | Reinforcement_Learning, Distributed_Training |
| Last Updated | 2026-02-07 20:00 GMT |
Overview
A policy optimization principle that updates LLM parameters using clipped policy gradient objectives with distributed model-parallel training.
Description
Policy Gradient Optimization is the core training step in RLVR where the policy model's parameters are updated using advantages computed from reward signals. The optimization uses PPO's clipped surrogate objective to prevent destructively large policy updates. The training step is distributed across multiple GPUs using Megatron-Core's tensor parallelism (TP), pipeline parallelism (PP), and context parallelism (CP).
This principle bridges the gap between the RL algorithm (PPO/GRPO) and the distributed training infrastructure (Megatron-Core or DeepSpeed), handling:
- Gradient computation through forward-backward passes with the PPO loss function
- Gradient accumulation across micro-batches for effective large batch sizes
- Model parallelism for training models too large for single GPUs
- Memory offloading for optimizer states during inference-heavy phases
Usage
Use this principle during the policy update step of RL training. The choice of backend (Megatron-Core vs DeepSpeed) depends on the model parallelism strategy and scale requirements.
Theoretical Basis
PPO Clipped Objective
Where:
Distributed Training
The forward-backward pass is parallelized using:
- Tensor Parallelism: Splits model layers across GPUs
- Pipeline Parallelism: Splits model stages across GPUs with micro-batch pipelining
- Data Parallelism: Replicates model across GPU groups with gradient all-reduce
Related Pages
Implemented By
Related Heuristics
The following heuristics inform this principle: