Principle:Volcengine Verl GAE Advantage Estimation
| Knowledge Sources | |
|---|---|
| Domains | Reinforcement_Learning, Policy_Optimization, Advantage_Estimation |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
A token-level advantage estimation method that uses a learned value function and temporal difference residuals to compute advantages with controllable bias-variance tradeoff.
Description
Generalized Advantage Estimation (GAE) computes advantages by combining multi-step temporal difference (TD) residuals through exponential weighting. Unlike GRPO which uses outcome-level (sequence-level) advantages, GAE computes token-level advantages using predictions from a learned critic (value function).
GAE addresses the fundamental bias-variance tradeoff in policy gradient methods:
- High (close to 1.0) produces low-bias, high-variance estimates (approaching Monte Carlo)
- Low (close to 0.0) produces high-bias, low-variance estimates (approaching one-step TD)
In the context of RLHF/PPO for language models, GAE is used with an actor-critic architecture where the critic predicts per-token values and the actor is updated using the GAE advantages.
Usage
Use GAE advantage estimation when:
- A learned reward model provides dense or nuanced reward signals
- An actor-critic architecture is desired (with a separate value function)
- Token-level credit assignment is important (e.g., long responses where specific tokens matter)
- The standard PPO algorithm with full RLHF pipeline is being used
GAE is selected in verl by setting algorithm.adv_estimator=gae.
Theoretical Basis
GAE computes advantages using the recursive formula:
Where:
- is the temporal difference residual at token
- is the critic's value prediction at token
- is the discount factor (typically 1.0 for language tasks)
- is the GAE lambda controlling bias-variance tradeoff (typically 1.0)
- is the token-level reward (usually 0 except at the final token)
The returns (targets for the critic) are computed as:
Pseudo-code:
# Abstract GAE computation (backward pass)
advantages = zeros_like(rewards)
last_gae = 0
for t in reversed(range(seq_length)):
delta = rewards[t] + gamma * values[t+1] * mask[t] - values[t]
advantages[t] = delta + gamma * lam * mask[t] * last_gae
last_gae = advantages[t]
returns = advantages + values