Principle:Online ml River Anomaly Detection Evaluation

Knowledge Sources	River River Docs Beating the Hold-Out: Bounds for K-fold and Progressive Cross-Validation
Domains	Online Machine Learning, Anomaly Detection, Model Evaluation, Progressive Validation
Last Updated	2026-02-08 16:00 GMT

Overview

Evaluation methodology for online anomaly detectors combining the ROCAUC metric with the progressive validation protocol, providing faithful assessment of how an anomaly detector would perform in a production streaming scenario.

Description

Anomaly Detection Evaluation in the streaming setting requires a fundamentally different approach from batch evaluation. In batch learning, a model is trained on a training set and evaluated on a held-out test set. In online learning, the model processes observations one at a time, and evaluation must interleave prediction with learning.

The progressive validation protocol provides the canonical evaluation methodology for online models. For each observation in the stream:

Predict (score) the observation before the model has seen it.
Record the prediction against the ground truth label.
Update the model with the observation.

This ensures that every prediction is made on genuinely unseen data, providing a realistic estimate of production performance.

For anomaly detectors specifically, progressive validation has a key difference from classification evaluation:

Instead of predict_one or predict_proba_one, the evaluation function calls score_one to obtain the anomaly score.
The ROCAUC (Receiver Operating Characteristic Area Under Curve) metric is used as the primary metric, as it evaluates the quality of the anomaly ranking regardless of any specific threshold choice.

ROCAUC is particularly well-suited for anomaly detection because:

Anomaly detection problems are typically highly imbalanced (e.g., 0.17% anomalies in CreditCard).
The optimal threshold is often unknown; ROCAUC evaluates the detector across all possible thresholds.
It directly measures the detector's ability to rank anomalies higher than normal observations.

Usage

Use anomaly detection evaluation when:

You need to assess the performance of an anomaly detector on a labeled data stream
You want realistic performance estimates that mimic production conditions
You are comparing different anomaly detectors or configurations
You want to track performance progression over time (via print_every)

Theoretical Basis

Progressive validation protocol for anomaly detection:

EVALUATE(dataset, model, metric):
    for each (x, y) in dataset:
        # 1. Score before learning
        score = model.score_one(x)

        # 2. Update metric with ground truth
        metric.update(y_true=y, y_pred=score)

        # 3. Learn from observation (unsupervised)
        model.learn_one(x)

    return metric

Anomaly detector detection:

The evaluation function automatically detects whether a model is an anomaly detector using utils.inspect.isanomalydetector(model) or utils.inspect.isanomalyfilter(model). When detected, it uses model.score_one instead of model.predict_one or model.predict_proba_one.

if isanomalydetector(model) or isanomalyfilter(model):
    pred_func = model.score_one
elif isclassifier(model) and not metric.requires_labels:
    pred_func = model.predict_proba_one
else:
    pred_func = model.predict_one

ROCAUC metric:

ROCAUC measures the probability that a randomly chosen anomalous observation receives a higher score than a randomly chosen normal observation:

ROCAUC = P(score(anomaly) > score(normal))

ROCAUC = 1.0: Perfect separation -- all anomalies scored higher than all normals.
ROCAUC = 0.5: Random performance -- no better than chance.
ROCAUC < 0.5: Worse than random (scores are inverted).

In River, ROCAUC is computed incrementally using an online approximation that processes one observation at a time, consistent with the streaming paradigm.

Delayed progressive validation:

For more realistic evaluation, the delay parameter can simulate the scenario where ground truth labels arrive after a delay (e.g., fraud is confirmed days after the transaction). This further stress-tests the detector by evaluating predictions that were made before recent model updates.

EVALUATE_WITH_DELAY(dataset, model, metric, delay):
    pending = {}

    for each event in simulate_qa(dataset, delay):
        if event is question (x, no label yet):
            score = model.score_one(x)
            pending[id] = score
        elif event is answer (label y for earlier observation):
            score = pending.pop(id)
            metric.update(y_true=y, y_pred=score)
            model.learn_one(x)

    return metric

AnomalyFilter evaluation:

When the model is an AnomalyFilter (ThresholdFilter or QuantileFilter), the evaluation function additionally calls model.classify(score) to convert the score to a binary prediction. This enables evaluation with classification metrics like precision, recall, and F1.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment