Principle:Online ml River Bandit Environments

Knowledge Sources	Bandit Algorithms Reinforcement Learning: An Introduction
Domains	Online_Learning Bandit_Algorithms Simulation
Last Updated	2026-02-08 18:00 GMT

Overview

Bandit simulation environments are synthetic or structured testbeds that generate reward signals in response to an agent's arm selections. They provide controlled, reproducible settings for developing, debugging, and comparing bandit policies with known ground-truth properties.

Description

While logged datasets capture real-world bandit problems, simulation environments offer complementary advantages for algorithm development:

Known optimal policy: The true reward distributions are defined programmatically, so the optimal policy and theoretical regret bounds are known.
Controlled difficulty: Parameters like the number of arms, reward gaps between arms, and noise levels can be adjusted systematically.
Unlimited interaction: The agent can interact with the environment for as many rounds as needed, unlike finite logged datasets.
Reproducibility: Given the same random seed, experiments produce identical results.

Classic bandit environments include:

K-armed testbed: Each arm has a reward distribution (e.g., Gaussian with different means). The standard benchmark from Sutton and Barto's reinforcement learning textbook.
Bernoulli bandits: Each arm produces binary rewards with a fixed but unknown probability.
Structured environments: Environments with specific reward structures designed to test particular algorithmic properties (e.g., many near-optimal arms, adversarial reward sequences, or non-stationary reward distributions).

Usage

Use bandit simulation environments when:

You are developing or debugging a new bandit policy.
You need to verify that an algorithm achieves its theoretical regret bound.
You want to systematically study how algorithm performance varies with problem parameters.
You need unlimited data for stress-testing policy behavior.

Theoretical Basis

K-armed testbed (Sutton & Barto): Define $K$ arms with true values:

q*(a) ~ N(0, 1)    for a = 1, ..., K
R_t | A_t = a  ~  N(q*(a), 1)

The optimal policy always selects $a^{*} = \arg \max_{a} q^{*} (a)$ , achieving expected reward $q^{*} (a^{*})$ per round.

Bernoulli bandit: Each arm $a$ has parameter $θ_{a} \in [0, 1]$ :

R_t | A_t = a  ~  Bernoulli(theta_a)

Gap parameter: The hardness of a bandit instance is characterized by the suboptimality gaps:

Delta_a = mu* - mu_a

Arms with smaller gaps are harder to distinguish from the optimal arm. Instance-dependent regret bounds scale as $\sum_{a} \frac{\ln T}{Δ_{a}}$ .

Non-stationary environments: Some environments change reward distributions over time, requiring policies with adaptive exploration. In restless bandits, all arm distributions drift simultaneously, while in switching bandits, the best arm changes at discrete change points.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment