Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Online ml River Bandit Exp3

From Leeroopedia


Knowledge Sources
Domains Online_Learning, Multi_Armed_Bandits, Adversarial_Bandits
Last Updated 2026-02-08 16:00 GMT

Overview

An adversarial bandit algorithm that maintains exponential weights for each arm and supports tunable egalitarianism through the gamma parameter.

Description

Exp3 (Exponential-weight algorithm for Exploration and Exploitation) is designed for non-stochastic (adversarial) bandit problems. It maintains a weight for each arm and uses these weights to define a probability distribution for arm selection. After observing a reward, the arm's weight is updated multiplicatively based on the reward divided by the selection probability (importance weighting). The gamma parameter controls egalitarianism: gamma=0 gives pure Exp3, while gamma=1 results in uniform random selection. The algorithm provides theoretical guarantees even in adversarial settings.

Usage

Use Exp3 when rewards may be adversarial or non-stationary with an adversary. It's particularly useful when you cannot assume rewards are drawn from fixed distributions. Set gamma based on the desired level of exploration, with higher values providing more uniform exploration.

Code Reference

Source Location

Signature

class Exp3(bandit.base.Policy):
    def __init__(
        self,
        gamma: float,
        reward_obj=None,
        reward_scaler=None,
        burn_in=0,
        seed: int | None = None,
    ):
        ...

Import

from river import bandit

I/O Contract

Parameter Type Description
gamma float Egalitarianism factor in [0, 1]
reward_obj RewardObj (optional) Reward statistic (defaults to stats.Mean())
reward_scaler TargetTransformRegressor (optional) Scales rewards before updating
burn_in int (default: 0) Minimum pulls per arm
seed int (optional) Random seed for reproducibility

Usage Examples

import gymnasium as gym
from river import bandit
from river import stats

env = gym.make('river_bandits/CandyCaneContest-v0')
_ = env.reset(seed=42)
_ = env.action_space.seed(123)

# Moderate exploration
policy = bandit.Exp3(gamma=0.5, seed=42)

metric = stats.Sum()
while True:
    action = policy.pull(range(env.action_space.n))
    observation, reward, terminated, truncated, info = env.step(action)
    policy.update(action, reward)
    metric.update(reward)
    if terminated or truncated:
        break

print(metric)  # Sum: 799.

# Check current weights
print(policy._weights)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment