Principle:Volcengine Verl GAE Advantage Estimation

Knowledge Sources	High-Dimensional Continuous Control Using Generalized Advantage Estimation Proximal Policy Optimization Algorithms
Domains	Reinforcement_Learning, Policy_Optimization, Advantage_Estimation
Last Updated	2026-02-07 14:00 GMT

Overview

A token-level advantage estimation method that uses a learned value function and temporal difference residuals to compute advantages with controllable bias-variance tradeoff.

Description

Generalized Advantage Estimation (GAE) computes advantages by combining multi-step temporal difference (TD) residuals through exponential weighting. Unlike GRPO which uses outcome-level (sequence-level) advantages, GAE computes token-level advantages using predictions from a learned critic (value function).

GAE addresses the fundamental bias-variance tradeoff in policy gradient methods:

High $λ$ (close to 1.0) produces low-bias, high-variance estimates (approaching Monte Carlo)
Low $λ$ (close to 0.0) produces high-bias, low-variance estimates (approaching one-step TD)

In the context of RLHF/PPO for language models, GAE is used with an actor-critic architecture where the critic predicts per-token values and the actor is updated using the GAE advantages.

Usage

Use GAE advantage estimation when:

A learned reward model provides dense or nuanced reward signals
An actor-critic architecture is desired (with a separate value function)
Token-level credit assignment is important (e.g., long responses where specific tokens matter)
The standard PPO algorithm with full RLHF pipeline is being used

GAE is selected in verl by setting algorithm.adv_estimator=gae.

Theoretical Basis

GAE computes advantages using the recursive formula:

$δ_{t} = r_{t} + γ V (s_{t + 1}) - V (s_{t})$

$A_{t}^{G A E (γ, λ)} = \sum_{l = 0}^{\infty} (γ λ)^{l} δ_{t + l}$

Where:

$δ_{t}$ is the temporal difference residual at token $t$
$V (s_{t})$ is the critic's value prediction at token $t$
$γ$ is the discount factor (typically 1.0 for language tasks)
$λ$ is the GAE lambda controlling bias-variance tradeoff (typically 1.0)
$r_{t}$ is the token-level reward (usually 0 except at the final token)

The returns (targets for the critic) are computed as:

$G_{t} = A_{t}^{G A E} + V (s_{t})$

Pseudo-code:

# Abstract GAE computation (backward pass)
advantages = zeros_like(rewards)
last_gae = 0
for t in reversed(range(seq_length)):
    delta = rewards[t] + gamma * values[t+1] * mask[t] - values[t]
    advantages[t] = delta + gamma * lam * mask[t] * last_gae
    last_gae = advantages[t]
returns = advantages + values

Related Pages

Implemented By

Implementation:Volcengine_Verl_Compute_GAE_Advantage_Return

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment