Implementation:Farama Foundation Gymnasium GAE Computation
| Knowledge Sources | |
|---|---|
| Domains | Reinforcement_Learning, Policy_Gradient |
| Last Updated | 2026-02-15 03:00 GMT |
Overview
User-defined computation pattern for Generalized Advantage Estimation used with Gymnasium vectorized environments.
Description
GAE computation is a Pattern Doc — it is not a built-in Gymnasium function but a standard computation pattern that users implement when building policy gradient algorithms on top of Gymnasium's vectorized environment interface. The pattern uses rewards, value estimates, and done signals collected from VectorEnv.step() to compute advantages via backward recursion.
Usage
Implement this pattern after collecting a rollout of T steps from N vectorized environments. Requires a value function (typically a neural network) to estimate state values for bootstrapping.
Code Reference
Source Location
- Repository: User-implemented pattern (not in Gymnasium source)
- Reference: Gymnasium tutorials use this pattern with vectorized environments
Signature
def compute_gae(
rewards: np.ndarray, # (T, N) rewards from envs.step()
values: np.ndarray, # (T+1, N) value estimates from critic
dones: np.ndarray, # (T, N) episode done flags
gamma: float = 0.99, # Discount factor
gae_lambda: float = 0.95, # GAE lambda parameter
) -> np.ndarray:
"""Compute GAE advantages from collected rollout data.
Args:
rewards: Per-step rewards, shape (T, N).
values: Value estimates, shape (T+1, N) including bootstrap.
dones: Done flags, shape (T, N).
gamma: Discount factor.
gae_lambda: GAE lambda for bias-variance tradeoff.
Returns:
advantages: GAE advantages, shape (T, N).
"""
Import
# User-defined function, no library import needed
import numpy as np
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| rewards | np.ndarray (T, N) | Yes | Rewards collected from VectorEnv.step() |
| values | np.ndarray (T+1, N) | Yes | Value estimates from critic network |
| dones | np.ndarray (T, N) | Yes | Episode completion flags |
| gamma | float | No | Discount factor (default 0.99) |
| gae_lambda | float | No | GAE lambda (default 0.95) |
Outputs
| Name | Type | Description |
|---|---|---|
| advantages | np.ndarray (T, N) | GAE advantage estimates per step per env |
Usage Examples
GAE Implementation
import numpy as np
def compute_gae(rewards, values, dones, gamma=0.99, gae_lambda=0.95):
T, N = rewards.shape
advantages = np.zeros((T, N))
last_gae = np.zeros(N)
for t in reversed(range(T)):
next_non_terminal = 1.0 - dones[t]
delta = rewards[t] + gamma * values[t + 1] * next_non_terminal - values[t]
last_gae = delta + gamma * gae_lambda * next_non_terminal * last_gae
advantages[t] = last_gae
return advantages
# Usage with vectorized environments
import gymnasium as gym
envs = gym.make_vec("CartPole-v1", num_envs=8)
obs, _ = envs.reset(seed=42)
T = 128 # rollout length
all_rewards = np.zeros((T, envs.num_envs))
all_dones = np.zeros((T, envs.num_envs))
all_values = np.zeros((T + 1, envs.num_envs))
for t in range(T):
# all_values[t] = critic(obs) # Value estimate
actions = envs.action_space.sample()
obs, rewards, terms, truncs, infos = envs.step(actions)
all_rewards[t] = rewards
all_dones[t] = np.logical_or(terms, truncs)
# all_values[T] = critic(obs) # Bootstrap value
advantages = compute_gae(all_rewards, all_values, all_dones)
envs.close()