Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Farama Foundation Gymnasium RL Agent Training Loop

From Leeroopedia
Knowledge Sources
Domains Reinforcement_Learning, Agent_Training, Tabular_RL
Last Updated 2026-02-15 03:00 GMT

Overview

End-to-end process for training a tabular reinforcement learning agent on a Gymnasium environment using the Q-learning algorithm.

Description

This workflow covers the standard procedure for training an RL agent from scratch using Gymnasium's environment API. It demonstrates the fundamental agent-environment interaction loop where the agent observes the environment state, selects an action (using an epsilon-greedy exploration strategy), receives a reward, and updates its Q-value estimates via the Bellman equation. The workflow covers environment initialization with gymnasium.make, the core training loop with env.step and env.reset, Q-table management using defaultdict, epsilon-greedy action selection, temporal difference learning updates, and training progress visualization. The primary example uses the Blackjack-v1 environment with tabular Q-learning, but the pattern applies to any discrete-state Gymnasium environment.

Usage

Execute this workflow when you have a discrete-state Gymnasium environment and want to train a tabular Q-learning agent. This is the foundational RL training pattern suitable for environments with manageable state spaces (such as Blackjack, FrozenLake, Taxi, or CliffWalking). Use this workflow when you need a baseline agent, want to understand Q-learning fundamentals, or are prototyping before moving to function approximation methods.

Execution Steps

Step 1: Environment Setup

Initialize the Gymnasium environment using the make factory function with the desired environment ID. Wrap the environment with RecordEpisodeStatistics to automatically track episode rewards, lengths, and timing data. Configure training hyperparameters including learning rate, number of episodes, epsilon schedule (start value, decay rate, and minimum floor), and discount factor.

Key considerations:

  • Choose an environment with a discrete state and action space for tabular methods
  • The RecordEpisodeStatistics wrapper must be applied before the training loop begins
  • Set buffer_length to match the total number of training episodes for complete tracking

Step 2: Agent Initialization

Create the Q-learning agent with a Q-table (dictionary mapping state-action pairs to value estimates), learning rate, discount factor, and epsilon-greedy exploration parameters. The Q-table is typically initialized as a defaultdict that returns zero-vectors for unseen states, allowing lazy initialization as new states are encountered during training.

Key considerations:

  • Use defaultdict with a lambda returning np.zeros(action_space.n) for automatic Q-table expansion
  • The initial epsilon should be high (typically 1.0) for maximum exploration at the start
  • Discount factor (gamma) controls how much the agent values future rewards versus immediate rewards

Step 3: Training Loop Execution

For each episode: reset the environment to get the initial observation, then loop until the episode terminates or is truncated. At each timestep, select an action using the epsilon-greedy strategy (random action with probability epsilon, best known action otherwise), execute the action via env.step, observe the reward and next state, and update the Q-value using the temporal difference update rule. After each episode, decay epsilon to gradually shift from exploration to exploitation.

Key considerations:

  • The Q-value update follows the Bellman equation: Q(s,a) += lr * (reward + gamma * max(Q(s')) - Q(s,a))
  • When the episode terminates, the future Q-value is zero (no future rewards possible)
  • Epsilon decay should be calibrated so the agent explores sufficiently before converging
  • Track the temporal difference error for monitoring learning progress

Step 4: Training Progress Visualization

After training completes, plot the learning curves using the statistics collected by the RecordEpisodeStatistics wrapper. Generate moving average plots for episode rewards, episode lengths, and temporal difference errors. Use a rolling window (e.g., 500 episodes) to smooth the noisy per-episode data and reveal overall trends.

Key considerations:

  • Episode rewards should show gradual improvement over training
  • Decreasing temporal difference errors indicate the Q-values are stabilizing
  • Episode length changes may indicate the agent is learning different strategies

Step 5: Agent Evaluation

Test the trained agent by running evaluation episodes with exploration disabled (epsilon set to zero). Collect statistics on win rate, average reward, and consistency (standard deviation) to assess the quality of the learned policy. Compare against known optimal performance for the environment if available.

Key considerations:

  • Disable exploration during evaluation to test the pure learned policy
  • Run sufficient evaluation episodes (e.g., 1000+) for statistically meaningful results
  • Restore the original epsilon after evaluation if further training is planned

Execution Diagram

GitHub URL

Workflow Repository