Principle:Online ml River Multi Armed Bandit Policies
| Knowledge Sources | Bandit Algorithms Finite-time Analysis of the Multiarmed Bandit Problem |
|---|---|
| Domains | Online_Learning Bandit_Algorithms Decision_Making |
| Last Updated | 2026-02-08 18:00 GMT |
Overview
Multi-armed bandit policies are sequential decision-making strategies that balance exploration (trying less-known actions to gather information) with exploitation (choosing the action currently believed to be best). They provide a principled framework for making decisions under uncertainty when feedback is limited to the chosen action's reward.
Description
The multi-armed bandit (MAB) problem is a foundational model in online decision-making. An agent repeatedly chooses one of arms (actions) and receives a stochastic reward drawn from that arm's unknown distribution. The goal is to maximize cumulative reward, or equivalently, minimize regret -- the difference between the agent's total reward and that of the best fixed arm in hindsight.
Key policy families include:
- Epsilon-greedy: With probability , explore a random arm; otherwise exploit the current best. Simple but often effective.
- Upper Confidence Bound (UCB): Select the arm with the highest upper confidence bound on its mean reward. Balances optimism with statistical uncertainty.
- Bayesian UCB: Uses a Bayesian posterior to compute upper confidence bounds, incorporating prior beliefs.
- Thompson Sampling: Sample from the posterior distribution of each arm's reward and select the arm with the highest sample. Naturally balances exploration and exploitation.
- Exp3: Designed for adversarial (non-stochastic) settings using exponential-weight updates. Robust when rewards are chosen by an adversary.
- LinUCB: A contextual bandit algorithm that uses linear models to predict rewards as a function of context features.
Usage
Use multi-armed bandit policies when:
- You must choose among a finite set of actions with uncertain outcomes.
- Feedback is limited to the reward of the chosen action (partial or bandit feedback).
- You need to balance exploration of new actions with exploitation of known good actions.
- Applications include A/B testing, recommendation systems, clinical trials, and ad placement.
Theoretical Basis
Regret: The fundamental performance measure is cumulative regret after rounds:
R(T) = T * mu* - sum_{t=1}^{T} mu_{A_t}
Where is the mean reward of the best arm and is the arm chosen at time .
UCB1 (Auer et al., 2002):
A_t = argmax_a [ mu_hat_a + sqrt(2 * ln(t) / N_a) ]
Where is the empirical mean reward of arm and is the number of times arm has been pulled. Achieves regret.
Thompson Sampling: For each arm with Beta posterior :
theta_a ~ Beta(alpha_a, beta_a)
A_t = argmax_a theta_a
After observing reward : update , .
Exp3 (adversarial setting): Maintains weights for each arm and selects proportionally:
P(A_t = a) = (1 - gamma) * w_a / sum(w) + gamma / K
Achieves regret even against adversarial reward sequences.
LinUCB (contextual): For context and arm :
A_t = argmax_a [ x_t^T * theta_hat_a + alpha * sqrt(x_t^T * A_a^{-1} * x_t) ]
Where is the ridge regression estimate and is the design matrix.
Related Pages
- Implementation:Online_ml_River_Bandit_Policy_Base
- Implementation:Online_ml_River_Bandit_BayesUCB
- Implementation:Online_ml_River_Bandit_EpsilonGreedy
- Implementation:Online_ml_River_Bandit_Exp3
- Implementation:Online_ml_River_Bandit_LinUCBDisjoint
- Implementation:Online_ml_River_Bandit_RandomPolicy
- Implementation:Online_ml_River_Bandit_ThompsonSampling
- Implementation:Online_ml_River_Bandit_UCB
- Principle:Online_ml_River_Bandit_Evaluation
- Principle:Online_ml_River_Bandit_Environments