Principle:Farama Foundation Gymnasium Q Learning Tabular

Knowledge Sources	Q-Learning Sutton and Barto RL Gymnasium Tutorials
Domains	Reinforcement_Learning, Value_Based_Methods
Last Updated	2026-02-15 03:00 GMT

Overview

A model-free temporal-difference learning algorithm that estimates the optimal action-value function using a lookup table and an epsilon-greedy exploration strategy.

Description

Tabular Q-Learning maintains a table mapping state-action pairs to estimated Q-values. At each step, the agent:

Observes state $s$
Selects action via epsilon-greedy: random with probability $ϵ$ , greedy otherwise
Receives reward $r$ and next state $s^{'}$
Updates $Q (s, a)$ using the TD update rule

Q-Learning is off-policy: the update uses $\max_{a} Q (s^{'}, a)$ regardless of the action actually taken. This enables learning the optimal policy while exploring.

Requirements for tabular Q-Learning:

Discrete state space: States must be hashable (tuples, integers)
Discrete action space: Typically Discrete(n) in Gymnasium
Sufficient exploration: Epsilon-greedy with decaying epsilon

Usage

Use tabular Q-Learning for environments with small, discrete state and action spaces (e.g., Blackjack, FrozenLake, Taxi). For continuous or high-dimensional states, use Deep Q-Networks (DQN) instead.

Theoretical Basis

The Q-Learning update rule: $Q (s_{t}, a_{t}) \leftarrow Q (s_{t}, a_{t}) + α [r_{t + 1} + γ \max_{a} Q (s_{t + 1}, a) - Q (s_{t}, a_{t})]$

Where:

$α$ : Learning rate
$γ$ : Discount factor
$\max_{a} Q (s^{'}, a)$ : Off-policy bootstrap target

Convergence is guaranteed under the Robbins-Monro conditions: every state-action pair is visited infinitely often, and the learning rate decays appropriately.

Epsilon-greedy policy: $π (a | s) = {\begin{cases} 1 - ϵ + \frac{ϵ}{| A |} & if a = \arg \max_{a^{'}} Q (s, a^{'}) \\ \frac{ϵ}{| A |} & otherwise \end{cases}$

Related Pages

Implemented By

Implementation:Farama_Foundation_Gymnasium_Q_Table_Agent

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment