Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Online ml River Bandit EpsilonGreedy

From Leeroopedia


Knowledge Sources
Domains Online_Learning, Multi_Armed_Bandits, Exploration_Exploitation
Last Updated 2026-02-08 16:00 GMT

Overview

A simple yet effective bandit policy that balances exploration and exploitation using epsilon-greedy action selection with optional decay.

Description

EpsilonGreedy selects the best-performing arm with probability (1 - epsilon) and explores by choosing a random arm with probability epsilon. The algorithm supports an optional decay parameter that exponentially reduces epsilon over time, allowing for more exploration early on and more exploitation later. A burn-in phase ensures each arm is tried a minimum number of times to mitigate selection bias. The current epsilon value is computed as epsilon * exp(-n * decay).

Usage

Use EpsilonGreedy when you need a simple, interpretable bandit algorithm with controllable exploration. Set epsilon between 0.05 and 0.2 for practical applications. Use the decay parameter if you want exploration to decrease over time as the algorithm gains confidence in its estimates.

Code Reference

Source Location

Signature

class EpsilonGreedy(bandit.base.Policy):
    def __init__(
        self,
        epsilon: float,
        decay=0.0,
        reward_obj=None,
        burn_in=0,
        seed: int | None = None,
    ):
        ...

    @property
    def current_epsilon(self):
        """The value of epsilon after factoring in the decay rate."""
        ...

Import

from river import bandit

I/O Contract

Parameter Type Description
epsilon float Exploration probability (0 to 1)
decay float (default: 0.0) Exponential decay rate for epsilon
reward_obj RewardObj (optional) Reward statistic (defaults to stats.Mean())
burn_in int (default: 0) Minimum pulls per arm before strategy applies
seed int (optional) Random seed for reproducibility

Usage Examples

import gymnasium as gym
from river import bandit
from river import stats

env = gym.make('river_bandits/CandyCaneContest-v0')
_ = env.reset(seed=42)
_ = env.action_space.seed(123)

# High initial exploration with decay
policy = bandit.EpsilonGreedy(
    epsilon=0.9,
    decay=0.001,
    seed=101
)

metric = stats.Sum()
while True:
    arm = policy.pull(range(env.action_space.n))
    observation, reward, terminated, truncated, info = env.step(arm)
    policy.update(arm, reward)
    metric.update(reward)
    if terminated or truncated:
        break

print(metric)  # Sum: 775.
print(policy.current_epsilon)  # Decayed epsilon value

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment