Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Volcengine Verl GAE Advantage Estimation

From Leeroopedia


Knowledge Sources
Domains Reinforcement_Learning, Policy_Optimization, Advantage_Estimation
Last Updated 2026-02-07 14:00 GMT

Overview

A token-level advantage estimation method that uses a learned value function and temporal difference residuals to compute advantages with controllable bias-variance tradeoff.

Description

Generalized Advantage Estimation (GAE) computes advantages by combining multi-step temporal difference (TD) residuals through exponential weighting. Unlike GRPO which uses outcome-level (sequence-level) advantages, GAE computes token-level advantages using predictions from a learned critic (value function).

GAE addresses the fundamental bias-variance tradeoff in policy gradient methods:

  • High λ (close to 1.0) produces low-bias, high-variance estimates (approaching Monte Carlo)
  • Low λ (close to 0.0) produces high-bias, low-variance estimates (approaching one-step TD)

In the context of RLHF/PPO for language models, GAE is used with an actor-critic architecture where the critic predicts per-token values and the actor is updated using the GAE advantages.

Usage

Use GAE advantage estimation when:

  • A learned reward model provides dense or nuanced reward signals
  • An actor-critic architecture is desired (with a separate value function)
  • Token-level credit assignment is important (e.g., long responses where specific tokens matter)
  • The standard PPO algorithm with full RLHF pipeline is being used

GAE is selected in verl by setting algorithm.adv_estimator=gae.

Theoretical Basis

GAE computes advantages using the recursive formula:

δt=rt+γV(st+1)V(st)

AtGAE(γ,λ)=l=0(γλ)lδt+l

Where:

  • δt is the temporal difference residual at token t
  • V(st) is the critic's value prediction at token t
  • γ is the discount factor (typically 1.0 for language tasks)
  • λ is the GAE lambda controlling bias-variance tradeoff (typically 1.0)
  • rt is the token-level reward (usually 0 except at the final token)

The returns (targets for the critic) are computed as:

Gt=AtGAE+V(st)

Pseudo-code:

# Abstract GAE computation (backward pass)
advantages = zeros_like(rewards)
last_gae = 0
for t in reversed(range(seq_length)):
    delta = rewards[t] + gamma * values[t+1] * mask[t] - values[t]
    advantages[t] = delta + gamma * lam * mask[t] * last_gae
    last_gae = advantages[t]
returns = advantages + values

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment