Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Online ml River Bandit Envs KArmedTestbed

From Leeroopedia


Knowledge Sources
Domains Online_Learning, Multi_Armed_Bandits, Reinforcement_Learning, Simulation
Last Updated 2026-02-08 16:00 GMT

Overview

A classic k-armed testbed environment for evaluating bandit algorithms, inspired by Sutton and Barto's Reinforcement Learning textbook.

Description

KArmedTestbed is a simple Gymnasium environment that implements the k-armed bandit problem. At initialization, each arm's true reward is drawn from a standard normal distribution. When an arm is pulled, the reward is sampled from a normal distribution centered at the arm's true reward with unit variance. The environment provides 1000 steps by default and uses a configurable number of arms (default 10). This creates a stationary bandit problem useful for benchmarking.

Usage

Use this environment for basic testing and evaluation of bandit algorithms. It's particularly useful for reproducing results from the reinforcement learning literature and for educational purposes. The stationary nature makes it ideal for comparing algorithm performance in controlled conditions.

Code Reference

Source Location

Signature

class KArmedTestbed(gym.Env):
    n_steps = 1000

    def __init__(self, k: int = 10):
        ...

    def reset(self, seed=None, options=None):
        ...

    def step(self, arm):
        ...

Import

import gymnasium as gym

env = gym.make('river_bandits/KArmedTestbed-v0')

I/O Contract

Parameter/Method Type Description
k int (default: 10) Number of arms
action_space Discrete(k) Action space (arm indices)
observation_space Discrete(k) Best arm index (not typically used)
reward_range (-inf, inf) Unbounded reward range

Usage Examples

import gymnasium as gym
from river import bandit
from river import stats

# Create environment with 10 arms
env = gym.make('river_bandits/KArmedTestbed-v0', k=10)
_ = env.reset(seed=42)

# Test a bandit policy
policy = bandit.UCB(delta=1, seed=42)

metric = stats.Mean()
for _ in range(1000):
    arm = policy.pull(range(env.action_space.n))
    observation, reward, terminated, truncated, info = env.step(arm)
    policy.update(arm, reward)
    metric.update(reward)
    if terminated or truncated:
        break

print(metric)  # Average reward

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment