Principle:Online ml River Online Model Selection

Knowledge Sources	Domains	Last Updated
Machine Learning Multi-Armed Bandits	Online_Learning, Model_Selection, Hyperparameter_Tuning	2026-02-08 18:00 GMT

Overview

Online model selection is the problem of choosing the best model or hyperparameter configuration from a set of candidates during streaming learning, where performance must be evaluated incrementally and the best candidate may change over time. Key approaches include bandit-based selection, greedy evaluation, and successive halving.

Description

In batch machine learning, model selection and hyperparameter tuning are typically performed via cross-validation over the full dataset. In online learning, the data arrives as a stream, and the model must be used and evaluated concurrently. Online model selection methods must balance the exploration-exploitation trade-off: exploring different candidate models to estimate their performance while exploiting the currently best-performing model for predictions.

Key approaches include:

Bandit-based selection: Frames model selection as a multi-armed bandit problem. Each candidate model is an "arm" that yields a reward (metric score) when pulled. Bandit algorithms like Upper Confidence Bound (UCB) or Epsilon-Greedy balance exploration (trying underexplored models) with exploitation (using the best-performing model). This provides theoretical regret guarantees.

Greedy selection: A simpler strategy that always uses the currently best-performing model for prediction. All candidate models are trained on every observation, but only the one with the best cumulative metric is used for predictions. This maximizes exploitation but may be slow to detect when a previously poor model becomes the best due to concept drift.

Successive halving: An efficient resource allocation strategy inspired by the Hyperband algorithm. Starting with many candidate models, it periodically eliminates the worst-performing half (or fraction), concentrating computational resources on the most promising candidates. In the online setting, "resources" are training observations rather than epochs.

Base infrastructure: Online model selection requires a common framework for wrapping candidate models, tracking their individual metrics, and switching between them for prediction. This includes maintaining separate copies of the metric for each candidate and managing the lifecycle of candidate models.

Usage

Use online model selection when:

You have multiple candidate models or hyperparameter configurations and want to identify the best one during streaming learning.
You need to adapt model choice over time in response to changing data distributions.
You want theoretical guarantees on the regret of your model selection strategy.
You want to efficiently allocate computational resources among many candidates.

Theoretical Basis

Multi-Armed Bandit Formulation

K candidate models (arms): m_1, ..., m_K
At each time step t:
    1. Select model m_i according to bandit policy
    2. Use m_i to predict on instance x_t
    3. Observe reward r_t (e.g., 1 if correct, 0 if incorrect)
    4. Update bandit statistics for m_i
    5. Train ALL models on (x_t, y_t)

UCB policy:
    Select m_i = argmax_i [ mean_reward_i + c * sqrt(log(t) / n_i) ]
    where n_i = number of times m_i has been selected
    and c = exploration parameter

The UCB bound ensures that the cumulative regret (loss from not always using the best model) grows at most O(sqrt(KT log T)) over T time steps.

Greedy Selection

At each time step t:
    1. best_model = argmax_i metric_i.get()
    2. Use best_model to predict on x_t
    3. For all models m_i:
        y_pred_i = m_i.predict(x_t)
        metric_i.update(y_t, y_pred_i)
        m_i.learn(x_t, y_t)

Successive Halving

Initialize: S_0 = all K candidate models
            budget B per round
For round r = 0, 1, 2, ...:
    n_r = |S_r|  (number of surviving models)
    Allocate B / n_r observations to each model in S_r
    Evaluate all models in S_r
    S_{r+1} = top half of S_r by performance
    Stop when |S_r| = 1

Successive halving reduces the total computation from O(KB) (evaluating all models on all data) to O(B log K), a significant efficiency gain when K is large.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment