Principle:Alibaba ROLL Advantage Estimation with KL Penalty

Knowledge Sources	PPO GRPO GAE Alibaba ROLL
Domains	Reinforcement_Learning, Optimization
Last Updated	2026-02-07 20:00 GMT

Overview

An advantage estimation principle that combines multiple estimation algorithms (GAE, GRPO, Reinforce++) with token-level KL divergence penalties to produce stable policy gradient signals.

Description

Advantage estimation transforms raw reward signals into normalized, variance-reduced training signals for policy gradient methods. This principle covers three interconnected operations:

Reward post-processing: Normalizing and clipping response-level rewards using running statistics
Token-level KL penalty: Adding a per-token KL divergence penalty between the current policy and a reference policy to prevent reward hacking
Advantage computation: Computing per-token advantages using one of several estimators (GAE, GRPO, Reinforce++)

The KL penalty is particularly important in RLHF/RLVR as it prevents the policy from diverging too far from the reference model, which would lead to degenerate outputs that exploit reward model weaknesses.

Usage

Use this principle after reward computation and before policy optimization. The choice of advantage estimator affects training dynamics:

GRPO: Group-relative, no value function needed, normalizes within a group of samples per prompt
Reinforce++: Standard baseline subtraction
GAE: Generalized Advantage Estimation with value function, requires a critic network

Theoretical Basis

KL-Penalized Reward

$r_{t}^{K L} = r_{t} - β \cdot K L [π_{θ} (a_{t} | s_{t}) ‖ π_{r e f} (a_{t} | s_{t})]$

Where $β$ is an adaptive coefficient updated based on a target KL budget.

Advantage Estimators

GAE: ${\hat{A}}_{t}^{G A E} = \sum_{l = 0}^{\infty} (γ λ)^{l} δ_{t + l}, δ_{t} = r_{t} + γ V (s_{t + 1}) - V (s_{t})$

GRPO (Group Relative): ${\hat{A}}_{i} = \frac{r_{i} - μ_{G}}{σ_{G}}$

Where $μ_{G}$ and $σ_{G}$ are the mean and std of rewards within the group.

Reinforce++: ${\hat{A}}_{t} = R_{t} - b (s_{t})$

Where $b (s_{t})$ is the running mean baseline.

Related Pages

Implemented By

Implementation:Alibaba_ROLL_Compute_Advantage

Related Heuristics

The following heuristics inform this principle:

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment