Workflow:Farama Foundation Gymnasium RL Agent Training Loop

Knowledge Sources	Gymnasium Gymnasium Docs Basic Usage Train Agent
Domains	Reinforcement_Learning, Agent_Training, Tabular_RL
Last Updated	2026-02-15 03:00 GMT

Overview

End-to-end process for training a tabular reinforcement learning agent on a Gymnasium environment using the Q-learning algorithm.

Description

This workflow covers the standard procedure for training an RL agent from scratch using Gymnasium's environment API. It demonstrates the fundamental agent-environment interaction loop where the agent observes the environment state, selects an action (using an epsilon-greedy exploration strategy), receives a reward, and updates its Q-value estimates via the Bellman equation. The workflow covers environment initialization with gymnasium.make, the core training loop with env.step and env.reset, Q-table management using defaultdict, epsilon-greedy action selection, temporal difference learning updates, and training progress visualization. The primary example uses the Blackjack-v1 environment with tabular Q-learning, but the pattern applies to any discrete-state Gymnasium environment.

Usage

Execute this workflow when you have a discrete-state Gymnasium environment and want to train a tabular Q-learning agent. This is the foundational RL training pattern suitable for environments with manageable state spaces (such as Blackjack, FrozenLake, Taxi, or CliffWalking). Use this workflow when you need a baseline agent, want to understand Q-learning fundamentals, or are prototyping before moving to function approximation methods.

Execution Steps

Step 1: Environment Setup

Initialize the Gymnasium environment using the make factory function with the desired environment ID. Wrap the environment with RecordEpisodeStatistics to automatically track episode rewards, lengths, and timing data. Configure training hyperparameters including learning rate, number of episodes, epsilon schedule (start value, decay rate, and minimum floor), and discount factor.

Key considerations:

Choose an environment with a discrete state and action space for tabular methods
The RecordEpisodeStatistics wrapper must be applied before the training loop begins
Set buffer_length to match the total number of training episodes for complete tracking

Step 2: Agent Initialization

Create the Q-learning agent with a Q-table (dictionary mapping state-action pairs to value estimates), learning rate, discount factor, and epsilon-greedy exploration parameters. The Q-table is typically initialized as a defaultdict that returns zero-vectors for unseen states, allowing lazy initialization as new states are encountered during training.

Key considerations:

Use defaultdict with a lambda returning np.zeros(action_space.n) for automatic Q-table expansion
The initial epsilon should be high (typically 1.0) for maximum exploration at the start
Discount factor (gamma) controls how much the agent values future rewards versus immediate rewards

Step 3: Training Loop Execution

For each episode: reset the environment to get the initial observation, then loop until the episode terminates or is truncated. At each timestep, select an action using the epsilon-greedy strategy (random action with probability epsilon, best known action otherwise), execute the action via env.step, observe the reward and next state, and update the Q-value using the temporal difference update rule. After each episode, decay epsilon to gradually shift from exploration to exploitation.

Key considerations:

The Q-value update follows the Bellman equation: Q(s,a) += lr * (reward + gamma * max(Q(s')) - Q(s,a))
When the episode terminates, the future Q-value is zero (no future rewards possible)
Epsilon decay should be calibrated so the agent explores sufficiently before converging
Track the temporal difference error for monitoring learning progress

Step 4: Training Progress Visualization

After training completes, plot the learning curves using the statistics collected by the RecordEpisodeStatistics wrapper. Generate moving average plots for episode rewards, episode lengths, and temporal difference errors. Use a rolling window (e.g., 500 episodes) to smooth the noisy per-episode data and reveal overall trends.

Key considerations:

Episode rewards should show gradual improvement over training
Decreasing temporal difference errors indicate the Q-values are stabilizing
Episode length changes may indicate the agent is learning different strategies

Step 5: Agent Evaluation

Test the trained agent by running evaluation episodes with exploration disabled (epsilon set to zero). Collect statistics on win rate, average reward, and consistency (standard deviation) to assess the quality of the learned policy. Compare against known optimal performance for the environment if available.

Key considerations:

Disable exploration during evaluation to test the pure learned policy
Run sufficient evaluation episodes (e.g., 1000+) for statistically meaningful results
Restore the original epsilon after evaluation if further training is planned

Execution Diagram

GitHub URL

Workflow Repository