Implementation:Online ml River Bandit EpsilonGreedy

Knowledge Sources	Online_ml_River
Domains	Online_Learning, Multi_Armed_Bandits, Exploration_Exploitation
Last Updated	2026-02-08 16:00 GMT

Overview

A simple yet effective bandit policy that balances exploration and exploitation using epsilon-greedy action selection with optional decay.

Description

EpsilonGreedy selects the best-performing arm with probability (1 - epsilon) and explores by choosing a random arm with probability epsilon. The algorithm supports an optional decay parameter that exponentially reduces epsilon over time, allowing for more exploration early on and more exploitation later. A burn-in phase ensures each arm is tried a minimum number of times to mitigate selection bias. The current epsilon value is computed as epsilon * exp(-n * decay).

Usage

Use EpsilonGreedy when you need a simple, interpretable bandit algorithm with controllable exploration. Set epsilon between 0.05 and 0.2 for practical applications. Use the decay parameter if you want exploration to decrease over time as the algorithm gains confidence in its estimates.

Code Reference

Source Location

Repository: Online_ml_River
File: river/bandit/epsilon_greedy.py

Signature

class EpsilonGreedy(bandit.base.Policy):
    def __init__(
        self,
        epsilon: float,
        decay=0.0,
        reward_obj=None,
        burn_in=0,
        seed: int | None = None,
    ):
        ...

    @property
    def current_epsilon(self):
        """The value of epsilon after factoring in the decay rate."""
        ...

Import

from river import bandit

I/O Contract

Parameter	Type	Description
epsilon	float	Exploration probability (0 to 1)
decay	float (default: 0.0)	Exponential decay rate for epsilon
reward_obj	RewardObj (optional)	Reward statistic (defaults to stats.Mean())
burn_in	int (default: 0)	Minimum pulls per arm before strategy applies
seed	int (optional)	Random seed for reproducibility

Usage Examples

import gymnasium as gym
from river import bandit
from river import stats

env = gym.make('river_bandits/CandyCaneContest-v0')
_ = env.reset(seed=42)
_ = env.action_space.seed(123)

# High initial exploration with decay
policy = bandit.EpsilonGreedy(
    epsilon=0.9,
    decay=0.001,
    seed=101
)

metric = stats.Sum()
while True:
    arm = policy.pull(range(env.action_space.n))
    observation, reward, terminated, truncated, info = env.step(arm)
    policy.update(arm, reward)
    metric.update(reward)
    if terminated or truncated:
        break

print(metric)  # Sum: 775.
print(policy.current_epsilon)  # Decayed epsilon value

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment