Implementation:Farama Foundation Gymnasium Q Table Agent

Knowledge Sources	Q-Learning Gymnasium Blackjack Tutorial
Domains	Reinforcement_Learning, Value_Based_Methods
Last Updated	2026-02-15 03:00 GMT

Overview

User-defined agent pattern implementing tabular Q-Learning with Gymnasium's discrete environment interface.

Description

The Q-Table Agent is a Pattern Doc — a standard RL agent pattern implemented by users on top of Gymnasium environments. It uses a defaultdict to store Q-values, epsilon-greedy for exploration, and the standard Q-Learning TD update. The agent interacts with environments via env.step() and env.reset() and uses env.action_space.n for action space size.

Usage

Implement this pattern for tabular RL tasks with discrete state and action spaces. Access env.action_space.n for Q-table initialization and use hashable observations (tuples) as dictionary keys.

Code Reference

Source Location

Repository: User-implemented pattern (based on Gymnasium tutorial)
Reference: Blackjack Tutorial

Signature

class QLearningAgent:
    def __init__(
        self,
        env: gym.Env,
        learning_rate: float = 0.01,
        initial_epsilon: float = 1.0,
        epsilon_decay: float = 0.001,
        final_epsilon: float = 0.1,
        discount_factor: float = 0.95,
    ):
        """Tabular Q-Learning agent.

        Args:
            env: Gymnasium environment with Discrete action space.
            learning_rate: Step size for Q-value updates.
            initial_epsilon: Starting exploration rate.
            epsilon_decay: Per-episode epsilon reduction.
            final_epsilon: Minimum exploration rate.
            discount_factor: Gamma for future reward discounting.
        """

Import

# User-defined agent, no library import beyond standard tools
import numpy as np
from collections import defaultdict
import gymnasium as gym

I/O Contract

Inputs

Name	Type	Required	Description
env	gym.Env	Yes	Environment with Discrete action space
learning_rate	float	No	Q-value update step size (default 0.01)
initial_epsilon	float	No	Starting exploration rate (default 1.0)
discount_factor	float	No	Gamma (default 0.95)

Outputs

Name	Type	Description
q_values	defaultdict	State-action value table
get_action(obs)	int	Epsilon-greedy action selection

Usage Examples

Blackjack Q-Learning Agent

import numpy as np
from collections import defaultdict
import gymnasium as gym

class QLearningAgent:
    def __init__(self, env, lr=0.01, gamma=0.95, epsilon=1.0, epsilon_decay=0.001, min_epsilon=0.1):
        self.env = env
        self.q_values = defaultdict(lambda: np.zeros(env.action_space.n))
        self.lr = lr
        self.gamma = gamma
        self.epsilon = epsilon
        self.epsilon_decay = epsilon_decay
        self.min_epsilon = min_epsilon

    def get_action(self, obs):
        if np.random.random() < self.epsilon:
            return self.env.action_space.sample()
        return int(np.argmax(self.q_values[obs]))

    def update(self, obs, action, reward, terminated, next_obs):
        future_q = 0 if terminated else np.max(self.q_values[next_obs])
        td_target = reward + self.gamma * future_q
        td_error = td_target - self.q_values[obs][action]
        self.q_values[obs][action] += self.lr * td_error

    def decay_epsilon(self):
        self.epsilon = max(self.min_epsilon, self.epsilon - self.epsilon_decay)

# Training
env = gym.make("Blackjack-v1", sab=True)
env = gym.wrappers.RecordEpisodeStatistics(env, buffer_length=100)
agent = QLearningAgent(env)

for episode in range(100000):
    obs, info = env.reset()
    terminated, truncated = False, False

    while not (terminated or truncated):
        action = agent.get_action(obs)
        next_obs, reward, terminated, truncated, info = env.step(action)
        agent.update(obs, action, reward, terminated, next_obs)
        obs = next_obs

    agent.decay_epsilon()
env.close()

Related Pages

Implements Principle

Principle:Farama_Foundation_Gymnasium_Q_Learning_Tabular

Requires Environment

Environment:Farama_Foundation_Gymnasium_Python_3_10_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment