Principle:Online ml River Bandit Datasets
| Knowledge Sources | Bandit Algorithms A Contextual-Bandit Approach to Personalized News Article Recommendation |
|---|---|
| Domains | Online_Learning Bandit_Algorithms Benchmarking |
| Last Updated | 2026-02-08 18:00 GMT |
Overview
Bandit benchmark datasets are curated collections of contextual decision-making data used to evaluate and compare bandit policies under reproducible conditions. These datasets capture the structure of real-world sequential decision problems where an agent selects actions based on context and receives partial reward feedback.
Description
Benchmarking bandit algorithms requires datasets that reflect the unique structure of the bandit problem: contexts (features), available actions, and rewards. Unlike standard supervised learning datasets, bandit datasets must account for the fact that reward is only observed for the chosen action.
Two main categories of bandit datasets exist:
- Logged bandit data: Data collected from a deployed policy, where each record contains the context, the action taken, the reward received, and (ideally) the probability of the action under the logging policy. Examples include news article recommendation logs.
- Supervised-to-bandit conversion: Standard classification datasets can be converted to bandit format by treating each class as an arm and assigning reward 1 if the chosen arm matches the true label, 0 otherwise.
Key properties of good bandit benchmark datasets:
- Realistic context: Feature representations that reflect actual decision-making scenarios.
- Multiple arms: A meaningful number of actions to choose from.
- Non-trivial reward structure: Rewards that vary across arms and contexts, requiring genuine exploration.
- Scale: Sufficient data for statistically meaningful evaluation.
Usage
Use bandit benchmark datasets when:
- You need to compare bandit policies under controlled conditions.
- You want to validate a new bandit algorithm before live deployment.
- You need reproducible experimental results for research or development.
- You want to test how policies handle varying numbers of arms and context dimensions.
Theoretical Basis
Supervised-to-bandit conversion: Given a classification dataset with classes, the bandit version provides:
context: x_i
arms: {1, 2, ..., K}
reward(a): 1 if a = y_i, else 0
Logged data format: Each record in logged bandit data contains:
(context, action, reward, propensity)
Where propensity under the logging policy is essential for unbiased off-policy evaluation.
Sample complexity: The number of samples needed for reliable evaluation scales as for arms and desired precision , because only a fraction of logged samples match any given target policy's decisions (under uniform logging).