Principle:Online ml River Dummy Baseline Estimation

Knowledge Sources	Domains	Last Updated
Machine Learning Statistics	Online_Learning, Evaluation, Benchmarking	2026-02-08 18:00 GMT

Overview

Dummy or baseline estimators are trivial predictive models that make predictions using simple statistical rules (such as always predicting the most frequent class or the mean of the target) without learning any relationship between features and targets. They establish a performance floor that any competent model must exceed.

Description

In machine learning, it is critical to establish baseline performance before evaluating complex models. A dummy estimator ignores the input features entirely and produces predictions based solely on simple statistics of the target variable. If a sophisticated model cannot outperform a dummy estimator, then it is not learning any useful patterns from the data.

Common strategies for dummy estimators include:

Most frequent (classification): Always predict the majority class. On imbalanced datasets, this can achieve deceptively high accuracy.
Stratified (classification): Predict classes according to their frequency distribution in the training data.
Mean (regression): Always predict the arithmetic mean of the observed target values.
Median (regression): Always predict the median of the observed target values.
Constant: Always predict a user-specified constant value.

In the online learning setting, these statistics are maintained incrementally: the running mean, running mode, or running class frequencies are updated with each new observation, and predictions reflect the most recent statistics.

Usage

Use dummy baseline estimators when:

You need a lower bound on acceptable model performance for a given dataset.
You want to verify that your evaluation pipeline is functioning correctly.
You are comparing multiple models and need a reference point.
You want to detect datasets where class imbalance makes simple accuracy misleading.

Theoretical Basis

Online Statistic Maintenance

Mean Dummy Regressor (online):
    Initialize: running_sum = 0, count = 0
    learn_one(x, y):
        running_sum += y
        count += 1
    predict_one(x):
        return running_sum / count

Most Frequent Dummy Classifier (online):
    Initialize: class_counts = {}
    learn_one(x, y):
        class_counts[y] += 1
    predict_one(x):
        return argmax(class_counts)

Theoretical Guarantees

The performance of dummy estimators provides important theoretical reference points:

Classification accuracy floor: For a dataset with class prior $p_{m a x}$ for the majority class, the most-frequent dummy achieves accuracy = $p_{m a x}$ .
Regression MSE ceiling: For a target with variance $σ^{2}$ , the mean dummy achieves MSE = $σ^{2}$ (the irreducible variance).
Regression MAE ceiling: The median dummy minimizes the expected absolute error.

Any model reporting performance worse than these baselines is actively harmful -- it would be better to use no model at all.

Role in Model Evaluation

The skill score framework normalizes model performance against the dummy baseline:

Skill = (Score_model - Score_dummy) / (Score_perfect - Score_dummy)

A skill of 0 means no improvement over the dummy; a skill of 1 means perfect performance. Negative skill means the model is worse than the baseline.

Related Pages

Implementation:Online_ml_River_Dummy_Estimators

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment