Implementation:Online ml River Bandit Envs KArmedTestbed
| Knowledge Sources | |
|---|---|
| Domains | Online_Learning, Multi_Armed_Bandits, Reinforcement_Learning, Simulation |
| Last Updated | 2026-02-08 16:00 GMT |
Overview
A classic k-armed testbed environment for evaluating bandit algorithms, inspired by Sutton and Barto's Reinforcement Learning textbook.
Description
KArmedTestbed is a simple Gymnasium environment that implements the k-armed bandit problem. At initialization, each arm's true reward is drawn from a standard normal distribution. When an arm is pulled, the reward is sampled from a normal distribution centered at the arm's true reward with unit variance. The environment provides 1000 steps by default and uses a configurable number of arms (default 10). This creates a stationary bandit problem useful for benchmarking.
Usage
Use this environment for basic testing and evaluation of bandit algorithms. It's particularly useful for reproducing results from the reinforcement learning literature and for educational purposes. The stationary nature makes it ideal for comparing algorithm performance in controlled conditions.
Code Reference
Source Location
- Repository: Online_ml_River
- File: river/bandit/envs/testbed.py
Signature
class KArmedTestbed(gym.Env):
n_steps = 1000
def __init__(self, k: int = 10):
...
def reset(self, seed=None, options=None):
...
def step(self, arm):
...
Import
import gymnasium as gym
env = gym.make('river_bandits/KArmedTestbed-v0')
I/O Contract
| Parameter/Method | Type | Description |
|---|---|---|
| k | int (default: 10) | Number of arms |
| action_space | Discrete(k) | Action space (arm indices) |
| observation_space | Discrete(k) | Best arm index (not typically used) |
| reward_range | (-inf, inf) | Unbounded reward range |
Usage Examples
import gymnasium as gym
from river import bandit
from river import stats
# Create environment with 10 arms
env = gym.make('river_bandits/KArmedTestbed-v0', k=10)
_ = env.reset(seed=42)
# Test a bandit policy
policy = bandit.UCB(delta=1, seed=42)
metric = stats.Mean()
for _ in range(1000):
arm = policy.pull(range(env.action_space.n))
observation, reward, terminated, truncated, info = env.step(arm)
policy.update(arm, reward)
metric.update(reward)
if terminated or truncated:
break
print(metric) # Average reward