Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Online ml River Multi Armed Bandit Policies

From Leeroopedia


Knowledge Sources Bandit Algorithms Finite-time Analysis of the Multiarmed Bandit Problem
Domains Online_Learning Bandit_Algorithms Decision_Making
Last Updated 2026-02-08 18:00 GMT

Overview

Multi-armed bandit policies are sequential decision-making strategies that balance exploration (trying less-known actions to gather information) with exploitation (choosing the action currently believed to be best). They provide a principled framework for making decisions under uncertainty when feedback is limited to the chosen action's reward.

Description

The multi-armed bandit (MAB) problem is a foundational model in online decision-making. An agent repeatedly chooses one of K arms (actions) and receives a stochastic reward drawn from that arm's unknown distribution. The goal is to maximize cumulative reward, or equivalently, minimize regret -- the difference between the agent's total reward and that of the best fixed arm in hindsight.

Key policy families include:

  • Epsilon-greedy: With probability ϵ, explore a random arm; otherwise exploit the current best. Simple but often effective.
  • Upper Confidence Bound (UCB): Select the arm with the highest upper confidence bound on its mean reward. Balances optimism with statistical uncertainty.
  • Bayesian UCB: Uses a Bayesian posterior to compute upper confidence bounds, incorporating prior beliefs.
  • Thompson Sampling: Sample from the posterior distribution of each arm's reward and select the arm with the highest sample. Naturally balances exploration and exploitation.
  • Exp3: Designed for adversarial (non-stochastic) settings using exponential-weight updates. Robust when rewards are chosen by an adversary.
  • LinUCB: A contextual bandit algorithm that uses linear models to predict rewards as a function of context features.

Usage

Use multi-armed bandit policies when:

  • You must choose among a finite set of actions with uncertain outcomes.
  • Feedback is limited to the reward of the chosen action (partial or bandit feedback).
  • You need to balance exploration of new actions with exploitation of known good actions.
  • Applications include A/B testing, recommendation systems, clinical trials, and ad placement.

Theoretical Basis

Regret: The fundamental performance measure is cumulative regret after T rounds:

R(T) = T * mu* - sum_{t=1}^{T} mu_{A_t}

Where μ* is the mean reward of the best arm and At is the arm chosen at time t.

UCB1 (Auer et al., 2002):

A_t = argmax_a [ mu_hat_a + sqrt(2 * ln(t) / N_a) ]

Where μ^a is the empirical mean reward of arm a and Na is the number of times arm a has been pulled. Achieves O(KTlnT) regret.

Thompson Sampling: For each arm a with Beta posterior Beta(αa,βa):

theta_a ~ Beta(alpha_a, beta_a)
A_t = argmax_a theta_a

After observing reward rt: update αAt+=rt, βAt+=(1rt).

Exp3 (adversarial setting): Maintains weights wa for each arm and selects proportionally:

P(A_t = a) = (1 - gamma) * w_a / sum(w) + gamma / K

Achieves O(KTlnK) regret even against adversarial reward sequences.

LinUCB (contextual): For context xt and arm a:

A_t = argmax_a [ x_t^T * theta_hat_a + alpha * sqrt(x_t^T * A_a^{-1} * x_t) ]

Where θ^a is the ridge regression estimate and Aa is the design matrix.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment