Workflow:Farama Foundation Gymnasium Policy Gradient Training

Knowledge Sources	Gymnasium Gymnasium Docs MuJoCo Environments REINFORCE Algorithm
Domains	Reinforcement_Learning, Deep_RL, Continuous_Control
Last Updated	2026-02-15 03:00 GMT

Overview

End-to-end process for training a neural network policy using the REINFORCE policy gradient algorithm on a continuous-control MuJoCo environment.

Description

This workflow implements the REINFORCE algorithm (also known as Monte Carlo Policy Gradient) to train a parametric policy network for continuous action spaces. Unlike value-based methods that learn a Q-table and derive a policy from it, REINFORCE directly optimizes the policy by maximizing the expected Monte Carlo returns. The policy network outputs the mean and standard deviation of a Normal distribution from which continuous actions are sampled. The workflow covers defining the policy network architecture in PyTorch, collecting full episode trajectories, computing discounted returns, calculating the policy gradient loss, and evaluating across multiple random seeds for statistical robustness. The primary example trains on the InvertedPendulum-v4 MuJoCo environment.

Usage

Execute this workflow when you need to train an RL agent with continuous action spaces (such as MuJoCo physics environments) using a policy gradient method. This is appropriate for continuous control problems where tabular methods are infeasible due to the infinite state and action spaces. Use this workflow as a starting point for understanding policy gradient methods before moving to more advanced algorithms like PPO or SAC.

Execution Steps

Step 1: Environment and Experiment Setup

Initialize the MuJoCo environment using gymnasium.make and wrap it with RecordEpisodeStatistics for performance tracking. Extract the observation space dimensions and action space dimensions from the environment. Define experiment parameters including total number of episodes, learning rate, discount factor (gamma), and the set of random seeds to test for statistical robustness.

Key considerations:

MuJoCo environments have continuous observation and action spaces (Box type)
Multiple random seeds (e.g., Fibonacci: 1, 2, 3, 5, 8) are essential for reliable evaluation
Deep RL is sensitive to random seeds; testing multiple seeds reveals true algorithm performance
Set torch.manual_seed, random.seed, and np.random.seed for reproducibility

Step 2: Policy Network Definition

Define a neural network that takes environment observations as input and outputs parameters of a probability distribution over actions. The network architecture consists of shared hidden layers (with Tanh activations) feeding into two separate output heads: one for the action mean and one for the action standard deviation. The standard deviation head uses a softplus transformation (log(1 + exp(x))) to ensure positive values.

Key considerations:

The network outputs mean and std of a Normal distribution for continuous actions
Shared layers extract common features; separate heads specialize for mean and std
Softplus ensures the standard deviation is always positive
Small network sizes (16-32 hidden units) are sufficient for simple control tasks

Step 3: Episode Trajectory Collection

For each training episode, reset the environment and collect a complete trajectory by repeatedly sampling actions from the policy distribution, executing them in the environment, and storing the log-probabilities of the sampled actions along with the received rewards. Continue until the episode terminates or is truncated. This produces one complete episode of experience per training iteration.

Key considerations:

REINFORCE requires complete episodes (Monte Carlo returns) before updating
Store log-probabilities (not raw probabilities) for numerical stability
Actions are sampled from Normal(mean, std) and then converted to numpy for env.step
The episode ends when terminated (task failure) or truncated (time limit reached)

Step 4: Policy Gradient Update

Compute discounted returns by iterating backwards through the episode rewards, applying the discount factor gamma at each step. Calculate the REINFORCE loss as the negative sum of log-probabilities weighted by their corresponding discounted returns. Perform backpropagation and update the policy network parameters using the optimizer (typically AdamW).

Pseudocode:

Compute discounted returns G_t = sum(gamma^k * r_{t+k}) for each timestep
Calculate loss = -sum(log_prob_t * G_t)
Zero gradients, backpropagate loss, step optimizer
Clear stored probabilities and rewards for next episode

Key considerations:

Returns are computed backwards for efficiency (running sum with discount)
The negative sign converts maximization of expected return to minimization of loss
Clear episode buffers after each update to prevent memory accumulation
A small epsilon (1e-6) is added to the distribution parameters for numerical stability

Step 5: Multi Seed Evaluation and Visualization

Repeat the entire training process across multiple random seeds, collecting the reward history for each seed. Aggregate the results and plot learning curves showing the mean and variance of episode rewards across seeds. Use seaborn or matplotlib to create publication-quality visualizations that demonstrate the algorithm's learning progress and stability.

Key considerations:

Reinitialize the agent for each seed to get independent training runs
Use pandas DataFrames and seaborn for clean multi-seed visualization
Log periodic progress (e.g., every 1000 episodes) during training
Expected behavior: rewards increase over training and stabilize near the environment maximum

Execution Diagram

GitHub URL

Workflow Repository