Workflow:Farama Foundation Gymnasium RL Agent Training Loop
| Knowledge Sources | |
|---|---|
| Domains | Reinforcement_Learning, Agent_Training, Tabular_RL |
| Last Updated | 2026-02-15 03:00 GMT |
Overview
End-to-end process for training a tabular reinforcement learning agent on a Gymnasium environment using the Q-learning algorithm.
Description
This workflow covers the standard procedure for training an RL agent from scratch using Gymnasium's environment API. It demonstrates the fundamental agent-environment interaction loop where the agent observes the environment state, selects an action (using an epsilon-greedy exploration strategy), receives a reward, and updates its Q-value estimates via the Bellman equation. The workflow covers environment initialization with gymnasium.make, the core training loop with env.step and env.reset, Q-table management using defaultdict, epsilon-greedy action selection, temporal difference learning updates, and training progress visualization. The primary example uses the Blackjack-v1 environment with tabular Q-learning, but the pattern applies to any discrete-state Gymnasium environment.
Usage
Execute this workflow when you have a discrete-state Gymnasium environment and want to train a tabular Q-learning agent. This is the foundational RL training pattern suitable for environments with manageable state spaces (such as Blackjack, FrozenLake, Taxi, or CliffWalking). Use this workflow when you need a baseline agent, want to understand Q-learning fundamentals, or are prototyping before moving to function approximation methods.
Execution Steps
Step 1: Environment Setup
Initialize the Gymnasium environment using the make factory function with the desired environment ID. Wrap the environment with RecordEpisodeStatistics to automatically track episode rewards, lengths, and timing data. Configure training hyperparameters including learning rate, number of episodes, epsilon schedule (start value, decay rate, and minimum floor), and discount factor.
Key considerations:
- Choose an environment with a discrete state and action space for tabular methods
- The RecordEpisodeStatistics wrapper must be applied before the training loop begins
- Set buffer_length to match the total number of training episodes for complete tracking
Step 2: Agent Initialization
Create the Q-learning agent with a Q-table (dictionary mapping state-action pairs to value estimates), learning rate, discount factor, and epsilon-greedy exploration parameters. The Q-table is typically initialized as a defaultdict that returns zero-vectors for unseen states, allowing lazy initialization as new states are encountered during training.
Key considerations:
- Use defaultdict with a lambda returning np.zeros(action_space.n) for automatic Q-table expansion
- The initial epsilon should be high (typically 1.0) for maximum exploration at the start
- Discount factor (gamma) controls how much the agent values future rewards versus immediate rewards
Step 3: Training Loop Execution
For each episode: reset the environment to get the initial observation, then loop until the episode terminates or is truncated. At each timestep, select an action using the epsilon-greedy strategy (random action with probability epsilon, best known action otherwise), execute the action via env.step, observe the reward and next state, and update the Q-value using the temporal difference update rule. After each episode, decay epsilon to gradually shift from exploration to exploitation.
Key considerations:
- The Q-value update follows the Bellman equation: Q(s,a) += lr * (reward + gamma * max(Q(s')) - Q(s,a))
- When the episode terminates, the future Q-value is zero (no future rewards possible)
- Epsilon decay should be calibrated so the agent explores sufficiently before converging
- Track the temporal difference error for monitoring learning progress
Step 4: Training Progress Visualization
After training completes, plot the learning curves using the statistics collected by the RecordEpisodeStatistics wrapper. Generate moving average plots for episode rewards, episode lengths, and temporal difference errors. Use a rolling window (e.g., 500 episodes) to smooth the noisy per-episode data and reveal overall trends.
Key considerations:
- Episode rewards should show gradual improvement over training
- Decreasing temporal difference errors indicate the Q-values are stabilizing
- Episode length changes may indicate the agent is learning different strategies
Step 5: Agent Evaluation
Test the trained agent by running evaluation episodes with exploration disabled (epsilon set to zero). Collect statistics on win rate, average reward, and consistency (standard deviation) to assess the quality of the learned policy. Compare against known optimal performance for the environment if available.
Key considerations:
- Disable exploration during evaluation to test the pure learned policy
- Run sufficient evaluation episodes (e.g., 1000+) for statistically meaningful results
- Restore the original epsilon after evaluation if further training is planned