Implementation:Online ml River Bandit UCB
| Knowledge Sources | |
|---|---|
| Domains | Online_Learning, Multi_Armed_Bandits, Upper_Confidence_Bounds |
| Last Updated | 2026-02-08 16:00 GMT |
Overview
Upper Confidence Bound algorithm that selects arms based on optimistic estimates combining mean reward and an exploration bonus.
Description
UCB (Upper Confidence Bound) computes an upper bound on the expected reward for each arm by combining the empirical mean with a confidence term proportional to sqrt(log(n) / n_arm). The delta parameter controls the width of the confidence interval. Setting delta=1 yields the classic UCB1 algorithm. The arm with the highest upper bound is selected, naturally balancing exploration (uncertainty bonus) and exploitation (mean reward). The algorithm has strong theoretical guarantees with logarithmic regret bounds. Reward scaling to sub-gaussian distributions is recommended for best performance.
Usage
Use UCB when you want an algorithm with theoretical guarantees and no randomness in arm selection (given deterministic reward statistics). It's particularly effective when rewards are bounded or scaled to [0,1]. Use reward_scaler with TargetStandardScaler for unbounded rewards to ensure sub-gaussian properties.
Code Reference
Source Location
- Repository: Online_ml_River
- File: river/bandit/ucb.py
Signature
class UCB(bandit.base.Policy):
def __init__(
self,
delta: float,
reward_obj=None,
reward_scaler=None,
burn_in=0,
seed: int | None = None,
):
...
Import
from river import bandit
I/O Contract
| Parameter | Type | Description |
|---|---|---|
| delta | float | Confidence level (delta=1 gives UCB1) |
| reward_obj | RewardObj (optional) | Reward statistic (defaults to stats.Mean()) |
| reward_scaler | TargetTransformRegressor (optional) | Scales rewards for sub-gaussian property |
| burn_in | int (default: 0) | Minimum pulls per arm |
| seed | int (optional) | Random seed for tie-breaking |
Usage Examples
import gymnasium as gym
from river import bandit
from river import preprocessing
from river import stats
env = gym.make('river_bandits/CandyCaneContest-v0')
_ = env.reset(seed=42)
_ = env.action_space.seed(123)
# Use with reward scaling
policy = bandit.UCB(
delta=100,
reward_scaler=preprocessing.TargetStandardScaler(None),
seed=42
)
metric = stats.Sum()
while True:
arm = policy.pull(range(env.action_space.n))
observation, reward, terminated, truncated, info = env.step(arm)
policy.update(arm, reward)
metric.update(reward)
if terminated or truncated:
break
print(metric) # Sum: 744.
# Classic UCB1
policy_ucb1 = bandit.UCB(delta=1, seed=42)