Workflow:Online ml River Streaming Anomaly Detection

Knowledge Sources	River River Documentation River JMLR Paper
Domains	Online_ML, Anomaly_Detection, Streaming_Data
Last Updated	2026-02-08 16:00 GMT

Overview

End-to-end process for detecting anomalies in streaming data using online anomaly detectors with configurable thresholding and filtering.

Description

This workflow covers the procedure for identifying anomalous observations in a continuous data stream. It leverages River's anomaly detection module, which provides unsupervised models like Half-Space Trees (an online variant of Isolation Forests) and One-Class SVM. The process includes feature normalization, anomaly scoring, threshold calibration using quantile filters, and evaluation against labeled ground truth when available. Anomaly detectors produce scores (higher means more anomalous) without requiring labeled training data.

Usage

Execute this workflow when you need to identify unusual patterns or outliers in a continuous data stream, such as fraud detection in financial transactions, network intrusion detection, or sensor fault detection. This workflow is appropriate when labeled anomaly data is scarce or unavailable, and the model must adapt to evolving normal behavior.

Execution Steps

Step 1: Load or Connect to an Unlabeled Data Stream

Obtain a stream of observations for anomaly detection. River provides built-in anomaly datasets (HTTP, SMTP, CreditCard) where anomalies are labeled for evaluation purposes. For production use, the stream consists of feature dictionaries without labels. The anomaly detector learns what is normal from the stream itself.

Key considerations:

Anomaly datasets are highly imbalanced (very few anomalies)
Each observation is a Python dict of numeric features
Labels, if available, are only used for evaluation, not training
Built-in datasets: datasets.CreditCard(), datasets.HTTP(), datasets.SMTP()

Step 2: Normalize Features

Apply feature scaling to ensure all input features are on comparable scales. This is critical for distance-based and tree-based anomaly detectors. MinMaxScaler normalizes features to the [0,1] range, which is the expected input range for HalfSpaceTrees. StandardScaler is an alternative for SVM-based detectors.

Key considerations:

HalfSpaceTrees expects features in the [0, 1] range; use MinMaxScaler
Scalers learn the feature distributions incrementally
Chain the scaler and detector using the pipe operator
Feature normalization happens transparently within the pipeline

Step 3: Configure the Anomaly Detector

Select and configure an anomaly detection model. HalfSpaceTrees is the primary choice, building an ensemble of random space-partitioning trees that assign anomaly scores based on isolation depth. OneClassSVM provides an alternative using a stochastic online SVM. Configure key parameters like number of trees, tree height, and window size.

Key considerations:

HalfSpaceTrees parameters: n_trees (ensemble size), height (tree depth), window_size (reference vs working window)
Larger window_size provides more stable reference distributions
OneClassSVM uses a nu parameter controlling the expected anomaly fraction
LocalOutlierFactor provides density-based detection with a configurable neighborhood size

Step 4: Apply Anomaly Filters for Classification

Wrap the anomaly detector with a filter that converts continuous scores into binary anomaly decisions. QuantileFilter flags observations whose anomaly score exceeds a given quantile threshold. ThresholdFilter uses a fixed score threshold. ConstantFilter uses a constant threshold value.

Key considerations:

QuantileFilter adaptively sets the threshold based on score distribution
A quantile of 0.95 flags the top 5% most anomalous observations
ThresholdFilter provides a fixed cutoff for deterministic thresholding
Filters wrap the detector and expose a classify() method

Step 5: Evaluate Detection Performance

When ground truth labels are available, evaluate the detector using streaming metrics. ROCAUC measures ranking quality of raw anomaly scores. For filtered binary predictions, use standard classification metrics. The progressive validation pattern applies: score first, evaluate, then update the model.

Key considerations:

Use ROCAUC for evaluating raw anomaly scores (without filtering)
For binary classification after filtering, use Accuracy, F1, Precision, Recall
The evaluation loop uses score_one(x) for raw scores or classify for filtered labels
progressive_val_score works with anomaly detectors and filters

Step 6: Deploy for Continuous Monitoring

In production, the anomaly detector processes each new observation, returning an anomaly score. The model continuously adapts to evolving normal patterns. Set up alerting based on the filter output or raw score thresholds. The model requires no batch retraining and handles concept drift naturally through its windowed reference mechanism.

Key considerations:

HalfSpaceTrees uses a reference/working window swap for adaptation
The model adapts to gradual distribution shifts automatically
Memory usage is bounded by the window_size and number of trees
No retraining or batch processing is needed

Execution Diagram

GitHub URL

Workflow Repository