Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Online ml River Bandit UCB

From Leeroopedia


Knowledge Sources
Domains Online_Learning, Multi_Armed_Bandits, Upper_Confidence_Bounds
Last Updated 2026-02-08 16:00 GMT

Overview

Upper Confidence Bound algorithm that selects arms based on optimistic estimates combining mean reward and an exploration bonus.

Description

UCB (Upper Confidence Bound) computes an upper bound on the expected reward for each arm by combining the empirical mean with a confidence term proportional to sqrt(log(n) / n_arm). The delta parameter controls the width of the confidence interval. Setting delta=1 yields the classic UCB1 algorithm. The arm with the highest upper bound is selected, naturally balancing exploration (uncertainty bonus) and exploitation (mean reward). The algorithm has strong theoretical guarantees with logarithmic regret bounds. Reward scaling to sub-gaussian distributions is recommended for best performance.

Usage

Use UCB when you want an algorithm with theoretical guarantees and no randomness in arm selection (given deterministic reward statistics). It's particularly effective when rewards are bounded or scaled to [0,1]. Use reward_scaler with TargetStandardScaler for unbounded rewards to ensure sub-gaussian properties.

Code Reference

Source Location

Signature

class UCB(bandit.base.Policy):
    def __init__(
        self,
        delta: float,
        reward_obj=None,
        reward_scaler=None,
        burn_in=0,
        seed: int | None = None,
    ):
        ...

Import

from river import bandit

I/O Contract

Parameter Type Description
delta float Confidence level (delta=1 gives UCB1)
reward_obj RewardObj (optional) Reward statistic (defaults to stats.Mean())
reward_scaler TargetTransformRegressor (optional) Scales rewards for sub-gaussian property
burn_in int (default: 0) Minimum pulls per arm
seed int (optional) Random seed for tie-breaking

Usage Examples

import gymnasium as gym
from river import bandit
from river import preprocessing
from river import stats

env = gym.make('river_bandits/CandyCaneContest-v0')
_ = env.reset(seed=42)
_ = env.action_space.seed(123)

# Use with reward scaling
policy = bandit.UCB(
    delta=100,
    reward_scaler=preprocessing.TargetStandardScaler(None),
    seed=42
)

metric = stats.Sum()
while True:
    arm = policy.pull(range(env.action_space.n))
    observation, reward, terminated, truncated, info = env.step(arm)
    policy.update(arm, reward)
    metric.update(reward)
    if terminated or truncated:
        break

print(metric)  # Sum: 744.

# Classic UCB1
policy_ucb1 = bandit.UCB(delta=1, seed=42)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment