Implementation:Farama Foundation Gymnasium Q Table Agent
| Knowledge Sources | |
|---|---|
| Domains | Reinforcement_Learning, Value_Based_Methods |
| Last Updated | 2026-02-15 03:00 GMT |
Overview
User-defined agent pattern implementing tabular Q-Learning with Gymnasium's discrete environment interface.
Description
The Q-Table Agent is a Pattern Doc — a standard RL agent pattern implemented by users on top of Gymnasium environments. It uses a defaultdict to store Q-values, epsilon-greedy for exploration, and the standard Q-Learning TD update. The agent interacts with environments via env.step() and env.reset() and uses env.action_space.n for action space size.
Usage
Implement this pattern for tabular RL tasks with discrete state and action spaces. Access env.action_space.n for Q-table initialization and use hashable observations (tuples) as dictionary keys.
Code Reference
Source Location
- Repository: User-implemented pattern (based on Gymnasium tutorial)
- Reference: Blackjack Tutorial
Signature
class QLearningAgent:
def __init__(
self,
env: gym.Env,
learning_rate: float = 0.01,
initial_epsilon: float = 1.0,
epsilon_decay: float = 0.001,
final_epsilon: float = 0.1,
discount_factor: float = 0.95,
):
"""Tabular Q-Learning agent.
Args:
env: Gymnasium environment with Discrete action space.
learning_rate: Step size for Q-value updates.
initial_epsilon: Starting exploration rate.
epsilon_decay: Per-episode epsilon reduction.
final_epsilon: Minimum exploration rate.
discount_factor: Gamma for future reward discounting.
"""
Import
# User-defined agent, no library import beyond standard tools
import numpy as np
from collections import defaultdict
import gymnasium as gym
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| env | gym.Env | Yes | Environment with Discrete action space |
| learning_rate | float | No | Q-value update step size (default 0.01) |
| initial_epsilon | float | No | Starting exploration rate (default 1.0) |
| discount_factor | float | No | Gamma (default 0.95) |
Outputs
| Name | Type | Description |
|---|---|---|
| q_values | defaultdict | State-action value table |
| get_action(obs) | int | Epsilon-greedy action selection |
Usage Examples
Blackjack Q-Learning Agent
import numpy as np
from collections import defaultdict
import gymnasium as gym
class QLearningAgent:
def __init__(self, env, lr=0.01, gamma=0.95, epsilon=1.0, epsilon_decay=0.001, min_epsilon=0.1):
self.env = env
self.q_values = defaultdict(lambda: np.zeros(env.action_space.n))
self.lr = lr
self.gamma = gamma
self.epsilon = epsilon
self.epsilon_decay = epsilon_decay
self.min_epsilon = min_epsilon
def get_action(self, obs):
if np.random.random() < self.epsilon:
return self.env.action_space.sample()
return int(np.argmax(self.q_values[obs]))
def update(self, obs, action, reward, terminated, next_obs):
future_q = 0 if terminated else np.max(self.q_values[next_obs])
td_target = reward + self.gamma * future_q
td_error = td_target - self.q_values[obs][action]
self.q_values[obs][action] += self.lr * td_error
def decay_epsilon(self):
self.epsilon = max(self.min_epsilon, self.epsilon - self.epsilon_decay)
# Training
env = gym.make("Blackjack-v1", sab=True)
env = gym.wrappers.RecordEpisodeStatistics(env, buffer_length=100)
agent = QLearningAgent(env)
for episode in range(100000):
obs, info = env.reset()
terminated, truncated = False, False
while not (terminated or truncated):
action = agent.get_action(obs)
next_obs, reward, terminated, truncated, info = env.step(action)
agent.update(obs, action, reward, terminated, next_obs)
obs = next_obs
agent.decay_epsilon()
env.close()