Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Farama Foundation Gymnasium REINFORCE Update

From Leeroopedia
Knowledge Sources
Domains Reinforcement_Learning, Policy_Gradient
Last Updated 2026-02-15 03:00 GMT

Overview

User-defined computation pattern for the REINFORCE policy gradient update using Gymnasium's environment interface.

Description

The REINFORCE update is a Pattern Doc — a standard training pattern implemented by users on top of Gymnasium environments. It collects episode data via env.step(), computes discounted returns, and updates policy parameters using the policy gradient theorem.

Usage

Implement this pattern for training a policy network on a Gymnasium environment. The environment provides observations, rewards, and done flags; the user provides the policy network, optimizer, and update logic.

Code Reference

Source Location

  • Repository: User-implemented pattern (based on Gymnasium tutorial)
  • Reference: REINFORCE Tutorial

Signature

def reinforce_update(
    policy: nn.Module,         # Parameterized policy network
    optimizer: torch.optim.Optimizer,
    rewards: list[float],      # Episode rewards from env.step()
    log_probs: list[torch.Tensor],  # Log probabilities of taken actions
    gamma: float = 0.99,       # Discount factor
) -> float:
    """Perform a REINFORCE policy gradient update.

    Args:
        policy: The policy network.
        optimizer: Optimizer for policy parameters.
        rewards: Per-step rewards from the episode.
        log_probs: Log probabilities of actions taken.
        gamma: Discount factor.

    Returns:
        loss: The policy gradient loss value.
    """

Import

# User-defined pattern
import torch
import torch.nn as nn
import gymnasium as gym

I/O Contract

Inputs

Name Type Required Description
rewards list[float] Yes Per-step rewards from env.step()
log_probs list[Tensor] Yes Log probs from policy forward pass
gamma float No Discount factor (default 0.99)

Outputs

Name Type Description
loss float Policy gradient loss for logging

Usage Examples

REINFORCE with Gymnasium

import torch
import torch.nn as nn
import numpy as np
import gymnasium as gym

class PolicyNetwork(nn.Module):
    def __init__(self, obs_size, action_size, hidden=128):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(obs_size, hidden),
            nn.ReLU(),
            nn.Linear(hidden, action_size),
            nn.Softmax(dim=-1),
        )

    def forward(self, x):
        return self.net(x)

# Training loop
env = gym.make("CartPole-v1")
policy = PolicyNetwork(4, 2)
optimizer = torch.optim.Adam(policy.parameters(), lr=1e-3)

for episode in range(500):
    obs, info = env.reset()
    log_probs, rewards = [], []

    terminated, truncated = False, False
    while not (terminated or truncated):
        obs_tensor = torch.FloatTensor(obs)
        probs = policy(obs_tensor)
        dist = torch.distributions.Categorical(probs)
        action = dist.sample()
        log_probs.append(dist.log_prob(action))

        obs, reward, terminated, truncated, info = env.step(action.item())
        rewards.append(reward)

    # Compute discounted returns
    returns = []
    G = 0
    for r in reversed(rewards):
        G = r + 0.99 * G
        returns.insert(0, G)
    returns = torch.tensor(returns)
    returns = (returns - returns.mean()) / (returns.std() + 1e-8)

    # Policy gradient update
    loss = -sum(lp * G for lp, G in zip(log_probs, returns))
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

env.close()

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment