Implementation:Online ml River Bandit Envs KArmedTestbed

Knowledge Sources	Online_ml_River
Domains	Online_Learning, Multi_Armed_Bandits, Reinforcement_Learning, Simulation
Last Updated	2026-02-08 16:00 GMT

Overview

A classic k-armed testbed environment for evaluating bandit algorithms, inspired by Sutton and Barto's Reinforcement Learning textbook.

Description

KArmedTestbed is a simple Gymnasium environment that implements the k-armed bandit problem. At initialization, each arm's true reward is drawn from a standard normal distribution. When an arm is pulled, the reward is sampled from a normal distribution centered at the arm's true reward with unit variance. The environment provides 1000 steps by default and uses a configurable number of arms (default 10). This creates a stationary bandit problem useful for benchmarking.

Usage

Use this environment for basic testing and evaluation of bandit algorithms. It's particularly useful for reproducing results from the reinforcement learning literature and for educational purposes. The stationary nature makes it ideal for comparing algorithm performance in controlled conditions.

Code Reference

Source Location

Repository: Online_ml_River
File: river/bandit/envs/testbed.py

Signature

class KArmedTestbed(gym.Env):
    n_steps = 1000

    def __init__(self, k: int = 10):
        ...

    def reset(self, seed=None, options=None):
        ...

    def step(self, arm):
        ...

Import

import gymnasium as gym

env = gym.make('river_bandits/KArmedTestbed-v0')

I/O Contract

Parameter/Method	Type	Description
k	int (default: 10)	Number of arms
action_space	Discrete(k)	Action space (arm indices)
observation_space	Discrete(k)	Best arm index (not typically used)
reward_range	(-inf, inf)	Unbounded reward range

Usage Examples

import gymnasium as gym
from river import bandit
from river import stats

# Create environment with 10 arms
env = gym.make('river_bandits/KArmedTestbed-v0', k=10)
_ = env.reset(seed=42)

# Test a bandit policy
policy = bandit.UCB(delta=1, seed=42)

metric = stats.Mean()
for _ in range(1000):
    arm = policy.pull(range(env.action_space.n))
    observation, reward, terminated, truncated, info = env.step(arm)
    policy.update(arm, reward)
    metric.update(reward)
    if terminated or truncated:
        break

print(metric)  # Average reward

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment