Principle:Online ml River Anomaly Detection Evaluation
| Knowledge Sources | River River Docs Beating the Hold-Out: Bounds for K-fold and Progressive Cross-Validation |
|---|---|
| Domains | Online Machine Learning, Anomaly Detection, Model Evaluation, Progressive Validation |
| Last Updated | 2026-02-08 16:00 GMT |
Overview
Evaluation methodology for online anomaly detectors combining the ROCAUC metric with the progressive validation protocol, providing faithful assessment of how an anomaly detector would perform in a production streaming scenario.
Description
Anomaly Detection Evaluation in the streaming setting requires a fundamentally different approach from batch evaluation. In batch learning, a model is trained on a training set and evaluated on a held-out test set. In online learning, the model processes observations one at a time, and evaluation must interleave prediction with learning.
The progressive validation protocol provides the canonical evaluation methodology for online models. For each observation in the stream:
- Predict (score) the observation before the model has seen it.
- Record the prediction against the ground truth label.
- Update the model with the observation.
This ensures that every prediction is made on genuinely unseen data, providing a realistic estimate of production performance.
For anomaly detectors specifically, progressive validation has a key difference from classification evaluation:
- Instead of
predict_oneorpredict_proba_one, the evaluation function callsscore_oneto obtain the anomaly score. - The ROCAUC (Receiver Operating Characteristic Area Under Curve) metric is used as the primary metric, as it evaluates the quality of the anomaly ranking regardless of any specific threshold choice.
ROCAUC is particularly well-suited for anomaly detection because:
- Anomaly detection problems are typically highly imbalanced (e.g., 0.17% anomalies in CreditCard).
- The optimal threshold is often unknown; ROCAUC evaluates the detector across all possible thresholds.
- It directly measures the detector's ability to rank anomalies higher than normal observations.
Usage
Use anomaly detection evaluation when:
- You need to assess the performance of an anomaly detector on a labeled data stream
- You want realistic performance estimates that mimic production conditions
- You are comparing different anomaly detectors or configurations
- You want to track performance progression over time (via print_every)
Theoretical Basis
Progressive validation protocol for anomaly detection:
EVALUATE(dataset, model, metric):
for each (x, y) in dataset:
# 1. Score before learning
score = model.score_one(x)
# 2. Update metric with ground truth
metric.update(y_true=y, y_pred=score)
# 3. Learn from observation (unsupervised)
model.learn_one(x)
return metric
Anomaly detector detection:
The evaluation function automatically detects whether a model is an anomaly detector using utils.inspect.isanomalydetector(model) or utils.inspect.isanomalyfilter(model). When detected, it uses model.score_one instead of model.predict_one or model.predict_proba_one.
if isanomalydetector(model) or isanomalyfilter(model):
pred_func = model.score_one
elif isclassifier(model) and not metric.requires_labels:
pred_func = model.predict_proba_one
else:
pred_func = model.predict_one
ROCAUC metric:
ROCAUC measures the probability that a randomly chosen anomalous observation receives a higher score than a randomly chosen normal observation:
ROCAUC = P(score(anomaly) > score(normal))
- ROCAUC = 1.0: Perfect separation -- all anomalies scored higher than all normals.
- ROCAUC = 0.5: Random performance -- no better than chance.
- ROCAUC < 0.5: Worse than random (scores are inverted).
In River, ROCAUC is computed incrementally using an online approximation that processes one observation at a time, consistent with the streaming paradigm.
Delayed progressive validation:
For more realistic evaluation, the delay parameter can simulate the scenario where ground truth labels arrive after a delay (e.g., fraud is confirmed days after the transaction). This further stress-tests the detector by evaluating predictions that were made before recent model updates.
EVALUATE_WITH_DELAY(dataset, model, metric, delay):
pending = {}
for each event in simulate_qa(dataset, delay):
if event is question (x, no label yet):
score = model.score_one(x)
pending[id] = score
elif event is answer (label y for earlier observation):
score = pending.pop(id)
metric.update(y_true=y, y_pred=score)
model.learn_one(x)
return metric
AnomalyFilter evaluation:
When the model is an AnomalyFilter (ThresholdFilter or QuantileFilter), the evaluation function additionally calls model.classify(score) to convert the score to a binary prediction. This enables evaluation with classification metrics like precision, recall, and F1.