Workflow:Online ml River Streaming Anomaly Detection
| Knowledge Sources | |
|---|---|
| Domains | Online_ML, Anomaly_Detection, Streaming_Data |
| Last Updated | 2026-02-08 16:00 GMT |
Overview
End-to-end process for detecting anomalies in streaming data using online anomaly detectors with configurable thresholding and filtering.
Description
This workflow covers the procedure for identifying anomalous observations in a continuous data stream. It leverages River's anomaly detection module, which provides unsupervised models like Half-Space Trees (an online variant of Isolation Forests) and One-Class SVM. The process includes feature normalization, anomaly scoring, threshold calibration using quantile filters, and evaluation against labeled ground truth when available. Anomaly detectors produce scores (higher means more anomalous) without requiring labeled training data.
Usage
Execute this workflow when you need to identify unusual patterns or outliers in a continuous data stream, such as fraud detection in financial transactions, network intrusion detection, or sensor fault detection. This workflow is appropriate when labeled anomaly data is scarce or unavailable, and the model must adapt to evolving normal behavior.
Execution Steps
Step 1: Load or Connect to an Unlabeled Data Stream
Obtain a stream of observations for anomaly detection. River provides built-in anomaly datasets (HTTP, SMTP, CreditCard) where anomalies are labeled for evaluation purposes. For production use, the stream consists of feature dictionaries without labels. The anomaly detector learns what is normal from the stream itself.
Key considerations:
- Anomaly datasets are highly imbalanced (very few anomalies)
- Each observation is a Python dict of numeric features
- Labels, if available, are only used for evaluation, not training
- Built-in datasets: datasets.CreditCard(), datasets.HTTP(), datasets.SMTP()
Step 2: Normalize Features
Apply feature scaling to ensure all input features are on comparable scales. This is critical for distance-based and tree-based anomaly detectors. MinMaxScaler normalizes features to the [0,1] range, which is the expected input range for HalfSpaceTrees. StandardScaler is an alternative for SVM-based detectors.
Key considerations:
- HalfSpaceTrees expects features in the [0, 1] range; use MinMaxScaler
- Scalers learn the feature distributions incrementally
- Chain the scaler and detector using the pipe operator
- Feature normalization happens transparently within the pipeline
Step 3: Configure the Anomaly Detector
Select and configure an anomaly detection model. HalfSpaceTrees is the primary choice, building an ensemble of random space-partitioning trees that assign anomaly scores based on isolation depth. OneClassSVM provides an alternative using a stochastic online SVM. Configure key parameters like number of trees, tree height, and window size.
Key considerations:
- HalfSpaceTrees parameters: n_trees (ensemble size), height (tree depth), window_size (reference vs working window)
- Larger window_size provides more stable reference distributions
- OneClassSVM uses a nu parameter controlling the expected anomaly fraction
- LocalOutlierFactor provides density-based detection with a configurable neighborhood size
Step 4: Apply Anomaly Filters for Classification
Wrap the anomaly detector with a filter that converts continuous scores into binary anomaly decisions. QuantileFilter flags observations whose anomaly score exceeds a given quantile threshold. ThresholdFilter uses a fixed score threshold. ConstantFilter uses a constant threshold value.
Key considerations:
- QuantileFilter adaptively sets the threshold based on score distribution
- A quantile of 0.95 flags the top 5% most anomalous observations
- ThresholdFilter provides a fixed cutoff for deterministic thresholding
- Filters wrap the detector and expose a classify() method
Step 5: Evaluate Detection Performance
When ground truth labels are available, evaluate the detector using streaming metrics. ROCAUC measures ranking quality of raw anomaly scores. For filtered binary predictions, use standard classification metrics. The progressive validation pattern applies: score first, evaluate, then update the model.
Key considerations:
- Use ROCAUC for evaluating raw anomaly scores (without filtering)
- For binary classification after filtering, use Accuracy, F1, Precision, Recall
- The evaluation loop uses score_one(x) for raw scores or classify for filtered labels
- progressive_val_score works with anomaly detectors and filters
Step 6: Deploy for Continuous Monitoring
In production, the anomaly detector processes each new observation, returning an anomaly score. The model continuously adapts to evolving normal patterns. Set up alerting based on the filter output or raw score thresholds. The model requires no batch retraining and handles concept drift naturally through its windowed reference mechanism.
Key considerations:
- HalfSpaceTrees uses a reference/working window swap for adaptation
- The model adapts to gradual distribution shifts automatically
- Memory usage is bounded by the window_size and number of trees
- No retraining or batch processing is needed