Principle:Alibaba ROLL Policy Gradient Optimization

Knowledge Sources	PPO Megatron-LM Alibaba ROLL
Domains	Reinforcement_Learning, Distributed_Training
Last Updated	2026-02-07 20:00 GMT

Overview

A policy optimization principle that updates LLM parameters using clipped policy gradient objectives with distributed model-parallel training.

Description

Policy Gradient Optimization is the core training step in RLVR where the policy model's parameters are updated using advantages computed from reward signals. The optimization uses PPO's clipped surrogate objective to prevent destructively large policy updates. The training step is distributed across multiple GPUs using Megatron-Core's tensor parallelism (TP), pipeline parallelism (PP), and context parallelism (CP).

This principle bridges the gap between the RL algorithm (PPO/GRPO) and the distributed training infrastructure (Megatron-Core or DeepSpeed), handling:

Gradient computation through forward-backward passes with the PPO loss function
Gradient accumulation across micro-batches for effective large batch sizes
Model parallelism for training models too large for single GPUs
Memory offloading for optimizer states during inference-heavy phases

Usage

Use this principle during the policy update step of RL training. The choice of backend (Megatron-Core vs DeepSpeed) depends on the model parallelism strategy and scale requirements.

Theoretical Basis

PPO Clipped Objective

$L^{C L I P} (θ) = 𝔼_{t} [\min (r_{t} (θ) {\hat{A}}_{t}, clip (r_{t} (θ), 1 - ϵ, 1 + ϵ) {\hat{A}}_{t})]$

Where: $r_{t} (θ) = \frac{π_{θ} (a_{t} | s_{t})}{π_{θ_{o l d}} (a_{t} | s_{t})}$

Distributed Training

The forward-backward pass is parallelized using:

Tensor Parallelism: Splits model layers across GPUs
Pipeline Parallelism: Splits model stages across GPUs with micro-batch pipelining
Data Parallelism: Replicates model across GPU groups with gradient all-reduce

Related Pages

Implemented By

Implementation:Alibaba_ROLL_MegatronTrainStrategy_Train_Step

Related Heuristics

The following heuristics inform this principle:

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment