Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Alibaba ROLL Policy Gradient Optimization

From Leeroopedia


Knowledge Sources
Domains Reinforcement_Learning, Distributed_Training
Last Updated 2026-02-07 20:00 GMT

Overview

A policy optimization principle that updates LLM parameters using clipped policy gradient objectives with distributed model-parallel training.

Description

Policy Gradient Optimization is the core training step in RLVR where the policy model's parameters are updated using advantages computed from reward signals. The optimization uses PPO's clipped surrogate objective to prevent destructively large policy updates. The training step is distributed across multiple GPUs using Megatron-Core's tensor parallelism (TP), pipeline parallelism (PP), and context parallelism (CP).

This principle bridges the gap between the RL algorithm (PPO/GRPO) and the distributed training infrastructure (Megatron-Core or DeepSpeed), handling:

  • Gradient computation through forward-backward passes with the PPO loss function
  • Gradient accumulation across micro-batches for effective large batch sizes
  • Model parallelism for training models too large for single GPUs
  • Memory offloading for optimizer states during inference-heavy phases

Usage

Use this principle during the policy update step of RL training. The choice of backend (Megatron-Core vs DeepSpeed) depends on the model parallelism strategy and scale requirements.

Theoretical Basis

PPO Clipped Objective

LCLIP(θ)=𝔼t[min(rt(θ)A^t,clip(rt(θ),1ϵ,1+ϵ)A^t)]

Where: rt(θ)=πθ(at|st)πθold(at|st)

Distributed Training

The forward-backward pass is parallelized using:

  • Tensor Parallelism: Splits model layers across GPUs
  • Pipeline Parallelism: Splits model stages across GPUs with micro-batch pipelining
  • Data Parallelism: Replicates model across GPU groups with gradient all-reduce

Related Pages

Implemented By

Related Heuristics

The following heuristics inform this principle:

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment