Principle:Farama Foundation Gymnasium Q Learning Tabular
| Knowledge Sources | |
|---|---|
| Domains | Reinforcement_Learning, Value_Based_Methods |
| Last Updated | 2026-02-15 03:00 GMT |
Overview
A model-free temporal-difference learning algorithm that estimates the optimal action-value function using a lookup table and an epsilon-greedy exploration strategy.
Description
Tabular Q-Learning maintains a table mapping state-action pairs to estimated Q-values. At each step, the agent:
- Observes state
- Selects action via epsilon-greedy: random with probability , greedy otherwise
- Receives reward and next state
- Updates using the TD update rule
Q-Learning is off-policy: the update uses regardless of the action actually taken. This enables learning the optimal policy while exploring.
Requirements for tabular Q-Learning:
- Discrete state space: States must be hashable (tuples, integers)
- Discrete action space: Typically Discrete(n) in Gymnasium
- Sufficient exploration: Epsilon-greedy with decaying epsilon
Usage
Use tabular Q-Learning for environments with small, discrete state and action spaces (e.g., Blackjack, FrozenLake, Taxi). For continuous or high-dimensional states, use Deep Q-Networks (DQN) instead.
Theoretical Basis
The Q-Learning update rule:
Where:
- : Learning rate
- : Discount factor
- : Off-policy bootstrap target
Convergence is guaranteed under the Robbins-Monro conditions: every state-action pair is visited infinitely often, and the learning rate decays appropriately.
Epsilon-greedy policy: