Principle:Online ml River Bandit Datasets

Knowledge Sources	Bandit Algorithms A Contextual-Bandit Approach to Personalized News Article Recommendation
Domains	Online_Learning Bandit_Algorithms Benchmarking
Last Updated	2026-02-08 18:00 GMT

Overview

Bandit benchmark datasets are curated collections of contextual decision-making data used to evaluate and compare bandit policies under reproducible conditions. These datasets capture the structure of real-world sequential decision problems where an agent selects actions based on context and receives partial reward feedback.

Description

Benchmarking bandit algorithms requires datasets that reflect the unique structure of the bandit problem: contexts (features), available actions, and rewards. Unlike standard supervised learning datasets, bandit datasets must account for the fact that reward is only observed for the chosen action.

Two main categories of bandit datasets exist:

Logged bandit data: Data collected from a deployed policy, where each record contains the context, the action taken, the reward received, and (ideally) the probability of the action under the logging policy. Examples include news article recommendation logs.
Supervised-to-bandit conversion: Standard classification datasets can be converted to bandit format by treating each class as an arm and assigning reward 1 if the chosen arm matches the true label, 0 otherwise.

Key properties of good bandit benchmark datasets:

Realistic context: Feature representations that reflect actual decision-making scenarios.
Multiple arms: A meaningful number of actions to choose from.
Non-trivial reward structure: Rewards that vary across arms and contexts, requiring genuine exploration.
Scale: Sufficient data for statistically meaningful evaluation.

Usage

Use bandit benchmark datasets when:

You need to compare bandit policies under controlled conditions.
You want to validate a new bandit algorithm before live deployment.
You need reproducible experimental results for research or development.
You want to test how policies handle varying numbers of arms and context dimensions.

Theoretical Basis

Supervised-to-bandit conversion: Given a classification dataset $(x_{i}, y_{i})$ with $K$ classes, the bandit version provides:

context: x_i
arms: {1, 2, ..., K}
reward(a): 1 if a = y_i, else 0

Logged data format: Each record in logged bandit data contains:

(context, action, reward, propensity)

Where propensity $= P (action ∣ context)$ under the logging policy is essential for unbiased off-policy evaluation.

Sample complexity: The number of samples needed for reliable evaluation scales as $O (K / ϵ^{2})$ for $K$ arms and desired precision $ϵ$ , because only a fraction $1 / K$ of logged samples match any given target policy's decisions (under uniform logging).

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment