Implementation:Online ml River Bandit Exp3
| Knowledge Sources | |
|---|---|
| Domains | Online_Learning, Multi_Armed_Bandits, Adversarial_Bandits |
| Last Updated | 2026-02-08 16:00 GMT |
Overview
An adversarial bandit algorithm that maintains exponential weights for each arm and supports tunable egalitarianism through the gamma parameter.
Description
Exp3 (Exponential-weight algorithm for Exploration and Exploitation) is designed for non-stochastic (adversarial) bandit problems. It maintains a weight for each arm and uses these weights to define a probability distribution for arm selection. After observing a reward, the arm's weight is updated multiplicatively based on the reward divided by the selection probability (importance weighting). The gamma parameter controls egalitarianism: gamma=0 gives pure Exp3, while gamma=1 results in uniform random selection. The algorithm provides theoretical guarantees even in adversarial settings.
Usage
Use Exp3 when rewards may be adversarial or non-stationary with an adversary. It's particularly useful when you cannot assume rewards are drawn from fixed distributions. Set gamma based on the desired level of exploration, with higher values providing more uniform exploration.
Code Reference
Source Location
- Repository: Online_ml_River
- File: river/bandit/exp3.py
Signature
class Exp3(bandit.base.Policy):
def __init__(
self,
gamma: float,
reward_obj=None,
reward_scaler=None,
burn_in=0,
seed: int | None = None,
):
...
Import
from river import bandit
I/O Contract
| Parameter | Type | Description |
|---|---|---|
| gamma | float | Egalitarianism factor in [0, 1] |
| reward_obj | RewardObj (optional) | Reward statistic (defaults to stats.Mean()) |
| reward_scaler | TargetTransformRegressor (optional) | Scales rewards before updating |
| burn_in | int (default: 0) | Minimum pulls per arm |
| seed | int (optional) | Random seed for reproducibility |
Usage Examples
import gymnasium as gym
from river import bandit
from river import stats
env = gym.make('river_bandits/CandyCaneContest-v0')
_ = env.reset(seed=42)
_ = env.action_space.seed(123)
# Moderate exploration
policy = bandit.Exp3(gamma=0.5, seed=42)
metric = stats.Sum()
while True:
action = policy.pull(range(env.action_space.n))
observation, reward, terminated, truncated, info = env.step(action)
policy.update(action, reward)
metric.update(reward)
if terminated or truncated:
break
print(metric) # Sum: 799.
# Check current weights
print(policy._weights)