Implementation:Online ml River Bandit EpsilonGreedy
| Knowledge Sources | |
|---|---|
| Domains | Online_Learning, Multi_Armed_Bandits, Exploration_Exploitation |
| Last Updated | 2026-02-08 16:00 GMT |
Overview
A simple yet effective bandit policy that balances exploration and exploitation using epsilon-greedy action selection with optional decay.
Description
EpsilonGreedy selects the best-performing arm with probability (1 - epsilon) and explores by choosing a random arm with probability epsilon. The algorithm supports an optional decay parameter that exponentially reduces epsilon over time, allowing for more exploration early on and more exploitation later. A burn-in phase ensures each arm is tried a minimum number of times to mitigate selection bias. The current epsilon value is computed as epsilon * exp(-n * decay).
Usage
Use EpsilonGreedy when you need a simple, interpretable bandit algorithm with controllable exploration. Set epsilon between 0.05 and 0.2 for practical applications. Use the decay parameter if you want exploration to decrease over time as the algorithm gains confidence in its estimates.
Code Reference
Source Location
- Repository: Online_ml_River
- File: river/bandit/epsilon_greedy.py
Signature
class EpsilonGreedy(bandit.base.Policy):
def __init__(
self,
epsilon: float,
decay=0.0,
reward_obj=None,
burn_in=0,
seed: int | None = None,
):
...
@property
def current_epsilon(self):
"""The value of epsilon after factoring in the decay rate."""
...
Import
from river import bandit
I/O Contract
| Parameter | Type | Description |
|---|---|---|
| epsilon | float | Exploration probability (0 to 1) |
| decay | float (default: 0.0) | Exponential decay rate for epsilon |
| reward_obj | RewardObj (optional) | Reward statistic (defaults to stats.Mean()) |
| burn_in | int (default: 0) | Minimum pulls per arm before strategy applies |
| seed | int (optional) | Random seed for reproducibility |
Usage Examples
import gymnasium as gym
from river import bandit
from river import stats
env = gym.make('river_bandits/CandyCaneContest-v0')
_ = env.reset(seed=42)
_ = env.action_space.seed(123)
# High initial exploration with decay
policy = bandit.EpsilonGreedy(
epsilon=0.9,
decay=0.001,
seed=101
)
metric = stats.Sum()
while True:
arm = policy.pull(range(env.action_space.n))
observation, reward, terminated, truncated, info = env.step(arm)
policy.update(arm, reward)
metric.update(reward)
if terminated or truncated:
break
print(metric) # Sum: 775.
print(policy.current_epsilon) # Decayed epsilon value