Implementation:Farama Foundation Gymnasium REINFORCE Update
Appearance
| Knowledge Sources | |
|---|---|
| Domains | Reinforcement_Learning, Policy_Gradient |
| Last Updated | 2026-02-15 03:00 GMT |
Overview
User-defined computation pattern for the REINFORCE policy gradient update using Gymnasium's environment interface.
Description
The REINFORCE update is a Pattern Doc — a standard training pattern implemented by users on top of Gymnasium environments. It collects episode data via env.step(), computes discounted returns, and updates policy parameters using the policy gradient theorem.
Usage
Implement this pattern for training a policy network on a Gymnasium environment. The environment provides observations, rewards, and done flags; the user provides the policy network, optimizer, and update logic.
Code Reference
Source Location
- Repository: User-implemented pattern (based on Gymnasium tutorial)
- Reference: REINFORCE Tutorial
Signature
def reinforce_update(
policy: nn.Module, # Parameterized policy network
optimizer: torch.optim.Optimizer,
rewards: list[float], # Episode rewards from env.step()
log_probs: list[torch.Tensor], # Log probabilities of taken actions
gamma: float = 0.99, # Discount factor
) -> float:
"""Perform a REINFORCE policy gradient update.
Args:
policy: The policy network.
optimizer: Optimizer for policy parameters.
rewards: Per-step rewards from the episode.
log_probs: Log probabilities of actions taken.
gamma: Discount factor.
Returns:
loss: The policy gradient loss value.
"""
Import
# User-defined pattern
import torch
import torch.nn as nn
import gymnasium as gym
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| rewards | list[float] | Yes | Per-step rewards from env.step() |
| log_probs | list[Tensor] | Yes | Log probs from policy forward pass |
| gamma | float | No | Discount factor (default 0.99) |
Outputs
| Name | Type | Description |
|---|---|---|
| loss | float | Policy gradient loss for logging |
Usage Examples
REINFORCE with Gymnasium
import torch
import torch.nn as nn
import numpy as np
import gymnasium as gym
class PolicyNetwork(nn.Module):
def __init__(self, obs_size, action_size, hidden=128):
super().__init__()
self.net = nn.Sequential(
nn.Linear(obs_size, hidden),
nn.ReLU(),
nn.Linear(hidden, action_size),
nn.Softmax(dim=-1),
)
def forward(self, x):
return self.net(x)
# Training loop
env = gym.make("CartPole-v1")
policy = PolicyNetwork(4, 2)
optimizer = torch.optim.Adam(policy.parameters(), lr=1e-3)
for episode in range(500):
obs, info = env.reset()
log_probs, rewards = [], []
terminated, truncated = False, False
while not (terminated or truncated):
obs_tensor = torch.FloatTensor(obs)
probs = policy(obs_tensor)
dist = torch.distributions.Categorical(probs)
action = dist.sample()
log_probs.append(dist.log_prob(action))
obs, reward, terminated, truncated, info = env.step(action.item())
rewards.append(reward)
# Compute discounted returns
returns = []
G = 0
for r in reversed(rewards):
G = r + 0.99 * G
returns.insert(0, G)
returns = torch.tensor(returns)
returns = (returns - returns.mean()) / (returns.std() + 1e-8)
# Policy gradient update
loss = -sum(lp * G for lp, G in zip(log_probs, returns))
optimizer.zero_grad()
loss.backward()
optimizer.step()
env.close()
Related Pages
Implements Principle
Requires Environment
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment