Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Farama Foundation Gymnasium GAE Computation

From Leeroopedia
Knowledge Sources
Domains Reinforcement_Learning, Policy_Gradient
Last Updated 2026-02-15 03:00 GMT

Overview

User-defined computation pattern for Generalized Advantage Estimation used with Gymnasium vectorized environments.

Description

GAE computation is a Pattern Doc — it is not a built-in Gymnasium function but a standard computation pattern that users implement when building policy gradient algorithms on top of Gymnasium's vectorized environment interface. The pattern uses rewards, value estimates, and done signals collected from VectorEnv.step() to compute advantages via backward recursion.

Usage

Implement this pattern after collecting a rollout of T steps from N vectorized environments. Requires a value function (typically a neural network) to estimate state values for bootstrapping.

Code Reference

Source Location

  • Repository: User-implemented pattern (not in Gymnasium source)
  • Reference: Gymnasium tutorials use this pattern with vectorized environments

Signature

def compute_gae(
    rewards: np.ndarray,       # (T, N) rewards from envs.step()
    values: np.ndarray,        # (T+1, N) value estimates from critic
    dones: np.ndarray,         # (T, N) episode done flags
    gamma: float = 0.99,       # Discount factor
    gae_lambda: float = 0.95,  # GAE lambda parameter
) -> np.ndarray:
    """Compute GAE advantages from collected rollout data.

    Args:
        rewards: Per-step rewards, shape (T, N).
        values: Value estimates, shape (T+1, N) including bootstrap.
        dones: Done flags, shape (T, N).
        gamma: Discount factor.
        gae_lambda: GAE lambda for bias-variance tradeoff.

    Returns:
        advantages: GAE advantages, shape (T, N).
    """

Import

# User-defined function, no library import needed
import numpy as np

I/O Contract

Inputs

Name Type Required Description
rewards np.ndarray (T, N) Yes Rewards collected from VectorEnv.step()
values np.ndarray (T+1, N) Yes Value estimates from critic network
dones np.ndarray (T, N) Yes Episode completion flags
gamma float No Discount factor (default 0.99)
gae_lambda float No GAE lambda (default 0.95)

Outputs

Name Type Description
advantages np.ndarray (T, N) GAE advantage estimates per step per env

Usage Examples

GAE Implementation

import numpy as np

def compute_gae(rewards, values, dones, gamma=0.99, gae_lambda=0.95):
    T, N = rewards.shape
    advantages = np.zeros((T, N))
    last_gae = np.zeros(N)

    for t in reversed(range(T)):
        next_non_terminal = 1.0 - dones[t]
        delta = rewards[t] + gamma * values[t + 1] * next_non_terminal - values[t]
        last_gae = delta + gamma * gae_lambda * next_non_terminal * last_gae
        advantages[t] = last_gae

    return advantages

# Usage with vectorized environments
import gymnasium as gym

envs = gym.make_vec("CartPole-v1", num_envs=8)
obs, _ = envs.reset(seed=42)

T = 128  # rollout length
all_rewards = np.zeros((T, envs.num_envs))
all_dones = np.zeros((T, envs.num_envs))
all_values = np.zeros((T + 1, envs.num_envs))

for t in range(T):
    # all_values[t] = critic(obs)  # Value estimate
    actions = envs.action_space.sample()
    obs, rewards, terms, truncs, infos = envs.step(actions)
    all_rewards[t] = rewards
    all_dones[t] = np.logical_or(terms, truncs)

# all_values[T] = critic(obs)  # Bootstrap value
advantages = compute_gae(all_rewards, all_values, all_dones)
envs.close()

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment