Principle:Online ml River Anomaly Detection Base Interface
| Knowledge Sources | Machine Learning Anomaly Detection: A Survey |
|---|---|
| Domains | Online_Learning Anomaly_Detection Software_Design |
| Last Updated | 2026-02-08 18:00 GMT |
Overview
The anomaly detection scoring interface defines a standardized contract for anomaly detectors that assign numerical anomaly scores to incoming observations in a streaming context. This interface abstracts the detection mechanism, enabling uniform composition, evaluation, and thresholding across diverse anomaly detection algorithms.
Description
Anomaly detection (also called outlier detection) identifies observations that deviate significantly from an expected pattern. In the online setting, detectors must process instances one at a time and produce a score indicating how anomalous each instance appears.
A well-designed base interface for anomaly scoring provides:
- score_one(x): Compute an anomaly score for a single observation. Higher scores indicate greater anomalousness.
- learn_one(x): Update the internal model with a new observation.
- Composability: Scores can be fed into thresholding or quantile-based filters to produce binary anomaly labels.
The separation of scoring from classification (normal vs. anomalous) is a deliberate design choice. Raw scores preserve information about the degree of anomalousness and allow downstream components to apply context-dependent thresholds.
Usage
Use an anomaly detection scoring interface when:
- You need a uniform API across multiple anomaly detection algorithms.
- You want to decouple scoring from thresholding decisions.
- You need to compose anomaly detectors with other pipeline components.
- You want to evaluate and compare detectors using standard streaming metrics.
Theoretical Basis
Anomaly scoring maps each observation to a real-valued score:
score: X -> R
Where higher values indicate greater deviation from the learned normal pattern. The scoring function is algorithm-dependent:
- Distance-based: Score proportional to distance from cluster centers or nearest neighbors.
- Density-based: Score inversely proportional to local density estimate.
- Probabilistic: Score derived from negative log-likelihood under a fitted distribution: .
- Reconstruction-based: Score based on reconstruction error from a learned representation.
Normalization: Anomaly scores from different detectors may have different ranges. A common approach is to normalize scores to using the cumulative distribution function of observed scores, enabling cross-algorithm comparison.
Threshold selection: Given scores, a binary decision requires a threshold :
label(x) = "anomaly" if score(x) > tau
"normal" otherwise
The threshold may be fixed, adaptive (e.g., based on a running quantile), or derived from a desired false positive rate.