Principle:Farama Foundation Gymnasium Generalized Advantage Estimation

Knowledge Sources	GAE Sutton and Barto RL
Domains	Reinforcement_Learning, Policy_Gradient
Last Updated	2026-02-15 03:00 GMT

Overview

A variance reduction technique for policy gradient methods that computes advantage estimates using an exponentially weighted sum of temporal-difference residuals.

Description

Generalized Advantage Estimation (GAE) addresses the bias-variance tradeoff in advantage estimation for policy gradient algorithms. Raw Monte Carlo returns have high variance, while single-step TD errors have high bias. GAE interpolates between these extremes using a parameter $λ \in [0, 1]$ :

$λ = 0$ : Single-step TD (low variance, high bias)
$λ = 1$ : Monte Carlo returns (high variance, low bias)

GAE is the standard advantage estimator in PPO, A2C, and other actor-critic methods. It provides smooth control over the bias-variance tradeoff, typically with $λ = 0.95$ as a widely-used default.

Usage

Use GAE when implementing actor-critic algorithms that require advantage estimates. It is computed after collecting a batch of trajectories from vectorized environments and requires a learned value function for bootstrapping.

Theoretical Basis

The TD residual at time $t$ : $δ_{t} = r_{t} + γ V (s_{t + 1}) - V (s_{t})$

The GAE advantage estimate: ${\hat{A}}_{t}^{G A E (γ, λ)} = \sum_{l = 0}^{T - t - 1} (γ λ)^{l} δ_{t + l}$

This can be computed efficiently via backward recursion: ${\hat{A}}_{T} = 0$ ${\hat{A}}_{t} = δ_{t} + γ λ {\hat{A}}_{t + 1}$

At episode boundaries (terminated=True): $V (s_{t + 1}) = 0$ At truncation (truncated=True): $V (s_{t + 1})$ is used for bootstrapping.

Related Pages

Implemented By

Implementation:Farama_Foundation_Gymnasium_GAE_Computation

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment