Principle:Online ml River Multi Armed Bandit Policies

Knowledge Sources	Bandit Algorithms Finite-time Analysis of the Multiarmed Bandit Problem
Domains	Online_Learning Bandit_Algorithms Decision_Making
Last Updated	2026-02-08 18:00 GMT

Overview

Multi-armed bandit policies are sequential decision-making strategies that balance exploration (trying less-known actions to gather information) with exploitation (choosing the action currently believed to be best). They provide a principled framework for making decisions under uncertainty when feedback is limited to the chosen action's reward.

Description

The multi-armed bandit (MAB) problem is a foundational model in online decision-making. An agent repeatedly chooses one of $K$ arms (actions) and receives a stochastic reward drawn from that arm's unknown distribution. The goal is to maximize cumulative reward, or equivalently, minimize regret -- the difference between the agent's total reward and that of the best fixed arm in hindsight.

Key policy families include:

Epsilon-greedy: With probability $ϵ$ , explore a random arm; otherwise exploit the current best. Simple but often effective.
Upper Confidence Bound (UCB): Select the arm with the highest upper confidence bound on its mean reward. Balances optimism with statistical uncertainty.
Bayesian UCB: Uses a Bayesian posterior to compute upper confidence bounds, incorporating prior beliefs.
Thompson Sampling: Sample from the posterior distribution of each arm's reward and select the arm with the highest sample. Naturally balances exploration and exploitation.
Exp3: Designed for adversarial (non-stochastic) settings using exponential-weight updates. Robust when rewards are chosen by an adversary.
LinUCB: A contextual bandit algorithm that uses linear models to predict rewards as a function of context features.

Usage

Use multi-armed bandit policies when:

You must choose among a finite set of actions with uncertain outcomes.
Feedback is limited to the reward of the chosen action (partial or bandit feedback).
You need to balance exploration of new actions with exploitation of known good actions.
Applications include A/B testing, recommendation systems, clinical trials, and ad placement.

Theoretical Basis

Regret: The fundamental performance measure is cumulative regret after $T$ rounds:

R(T) = T * mu* - sum_{t=1}^{T} mu_{A_t}

Where $μ^{*}$ is the mean reward of the best arm and $A_{t}$ is the arm chosen at time $t$ .

UCB1 (Auer et al., 2002):

A_t = argmax_a [ mu_hat_a + sqrt(2 * ln(t) / N_a) ]

Where ${\hat{μ}}_{a}$ is the empirical mean reward of arm $a$ and $N_{a}$ is the number of times arm $a$ has been pulled. Achieves $O (\sqrt{K T \ln T})$ regret.

Thompson Sampling: For each arm $a$ with Beta posterior $Beta (α_{a}, β_{a})$ :

theta_a ~ Beta(alpha_a, beta_a)
A_t = argmax_a theta_a

After observing reward $r_{t}$ : update $α_{A_{t}} + = r_{t}$ , $β_{A_{t}} + = (1 - r_{t})$ .

Exp3 (adversarial setting): Maintains weights $w_{a}$ for each arm and selects proportionally:

P(A_t = a) = (1 - gamma) * w_a / sum(w) + gamma / K

Achieves $O (\sqrt{K T \ln K})$ regret even against adversarial reward sequences.

LinUCB (contextual): For context $x_{t}$ and arm $a$ :

A_t = argmax_a [ x_t^T * theta_hat_a + alpha * sqrt(x_t^T * A_a^{-1} * x_t) ]

Where ${\hat{θ}}_{a}$ is the ridge regression estimate and $A_{a}$ is the design matrix.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment