Principle:Online ml River Bandit Evaluation

Knowledge Sources	Bandit Algorithms Unbiased Offline Evaluation of Contextual-bandit-based News Article Recommendation
Domains	Online_Learning Bandit_Algorithms Evaluation
Last Updated	2026-02-08 18:00 GMT

Overview

Bandit policy evaluation is the methodology for assessing how well a bandit policy performs at selecting actions and accumulating rewards over time. Unlike standard supervised evaluation, bandit evaluation must account for partial feedback (only the chosen arm's reward is observed) and the sequential, adaptive nature of the decision process.

Description

Evaluating bandit policies presents unique challenges compared to standard classification or regression evaluation:

Partial feedback: At each round, only the reward for the chosen arm is observed, not the rewards for unchosen arms.
Counterfactual reasoning: Evaluating a new policy using historical data (collected by a different policy) requires off-policy evaluation techniques.
Non-stationarity: Reward distributions may change over time, making historical performance an imperfect predictor of future performance.

Common evaluation approaches include:

Online evaluation: Deploy the policy in a live environment and measure cumulative reward directly.
Progressive validation: Analogous to prequential evaluation in supervised learning -- predict first, then observe the reward and update.
Replay method: Use logged data from a randomized policy to simulate what would have happened under the target policy.
Regret tracking: Measure cumulative regret against the best fixed policy (or the best adaptive policy in non-stationary settings).

Usage

Use bandit policy evaluation when:

You need to compare multiple bandit strategies on the same problem.
You want to monitor a deployed bandit policy's performance in real time.
You need to estimate how a new policy would perform using historical logged data.
You want to detect when a policy's performance is degrading.

Theoretical Basis

Cumulative reward:

G(T) = sum_{t=1}^{T} r_t

Cumulative regret:

R(T) = sum_{t=1}^{T} (mu* - r_t)

Where $μ^{*}$ is the expected reward of the optimal arm.

Replay method (Li et al., 2011): Given logged data $(x_{t}, a_{t}, r_{t})$ collected by a logging policy $π_{0}$ , estimate the value of target policy $π$ :

V_hat(pi) = (1/|S|) * sum_{t in S} r_t
where S = {t : pi(x_t) = a_t}

This is unbiased when the logging policy is uniformly random. For non-uniform logging policies, inverse propensity scoring (IPS) is used:

V_hat_IPS(pi) = (1/T) * sum_{t=1}^{T} r_t * I(pi(x_t) = a_t) / pi_0(a_t | x_t)

Progressive validation for bandits: At each time step, the policy selects an arm, observes the reward, updates its internal state, and the running average reward is tracked as the performance metric.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment