Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Farama Foundation Gymnasium Q Learning Tabular

From Leeroopedia
Knowledge Sources
Domains Reinforcement_Learning, Value_Based_Methods
Last Updated 2026-02-15 03:00 GMT

Overview

A model-free temporal-difference learning algorithm that estimates the optimal action-value function using a lookup table and an epsilon-greedy exploration strategy.

Description

Tabular Q-Learning maintains a table mapping state-action pairs to estimated Q-values. At each step, the agent:

  1. Observes state s
  2. Selects action via epsilon-greedy: random with probability ϵ, greedy otherwise
  3. Receives reward r and next state s
  4. Updates Q(s,a) using the TD update rule

Q-Learning is off-policy: the update uses maxaQ(s,a) regardless of the action actually taken. This enables learning the optimal policy while exploring.

Requirements for tabular Q-Learning:

  • Discrete state space: States must be hashable (tuples, integers)
  • Discrete action space: Typically Discrete(n) in Gymnasium
  • Sufficient exploration: Epsilon-greedy with decaying epsilon

Usage

Use tabular Q-Learning for environments with small, discrete state and action spaces (e.g., Blackjack, FrozenLake, Taxi). For continuous or high-dimensional states, use Deep Q-Networks (DQN) instead.

Theoretical Basis

The Q-Learning update rule: Q(st,at)Q(st,at)+α[rt+1+γmaxaQ(st+1,a)Q(st,at)]

Where:

  • α: Learning rate
  • γ: Discount factor
  • maxaQ(s,a): Off-policy bootstrap target

Convergence is guaranteed under the Robbins-Monro conditions: every state-action pair is visited infinitely often, and the learning rate decays appropriately.

Epsilon-greedy policy: π(a|s)={1ϵ+ϵ|A|if a=argmaxaQ(s,a)ϵ|A|otherwise

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment