Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Online ml River Streaming Anomaly Detection

From Leeroopedia
Revision as of 11:03, 16 February 2026 by Admin (talk | contribs) (Auto-imported from workflows/Online_ml_River_Streaming_Anomaly_Detection.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains Online_ML, Anomaly_Detection, Streaming_Data
Last Updated 2026-02-08 16:00 GMT

Overview

End-to-end process for detecting anomalies in streaming data using online anomaly detectors with configurable thresholding and filtering.

Description

This workflow covers the procedure for identifying anomalous observations in a continuous data stream. It leverages River's anomaly detection module, which provides unsupervised models like Half-Space Trees (an online variant of Isolation Forests) and One-Class SVM. The process includes feature normalization, anomaly scoring, threshold calibration using quantile filters, and evaluation against labeled ground truth when available. Anomaly detectors produce scores (higher means more anomalous) without requiring labeled training data.

Usage

Execute this workflow when you need to identify unusual patterns or outliers in a continuous data stream, such as fraud detection in financial transactions, network intrusion detection, or sensor fault detection. This workflow is appropriate when labeled anomaly data is scarce or unavailable, and the model must adapt to evolving normal behavior.

Execution Steps

Step 1: Load or Connect to an Unlabeled Data Stream

Obtain a stream of observations for anomaly detection. River provides built-in anomaly datasets (HTTP, SMTP, CreditCard) where anomalies are labeled for evaluation purposes. For production use, the stream consists of feature dictionaries without labels. The anomaly detector learns what is normal from the stream itself.

Key considerations:

  • Anomaly datasets are highly imbalanced (very few anomalies)
  • Each observation is a Python dict of numeric features
  • Labels, if available, are only used for evaluation, not training
  • Built-in datasets: datasets.CreditCard(), datasets.HTTP(), datasets.SMTP()

Step 2: Normalize Features

Apply feature scaling to ensure all input features are on comparable scales. This is critical for distance-based and tree-based anomaly detectors. MinMaxScaler normalizes features to the [0,1] range, which is the expected input range for HalfSpaceTrees. StandardScaler is an alternative for SVM-based detectors.

Key considerations:

  • HalfSpaceTrees expects features in the [0, 1] range; use MinMaxScaler
  • Scalers learn the feature distributions incrementally
  • Chain the scaler and detector using the pipe operator
  • Feature normalization happens transparently within the pipeline

Step 3: Configure the Anomaly Detector

Select and configure an anomaly detection model. HalfSpaceTrees is the primary choice, building an ensemble of random space-partitioning trees that assign anomaly scores based on isolation depth. OneClassSVM provides an alternative using a stochastic online SVM. Configure key parameters like number of trees, tree height, and window size.

Key considerations:

  • HalfSpaceTrees parameters: n_trees (ensemble size), height (tree depth), window_size (reference vs working window)
  • Larger window_size provides more stable reference distributions
  • OneClassSVM uses a nu parameter controlling the expected anomaly fraction
  • LocalOutlierFactor provides density-based detection with a configurable neighborhood size

Step 4: Apply Anomaly Filters for Classification

Wrap the anomaly detector with a filter that converts continuous scores into binary anomaly decisions. QuantileFilter flags observations whose anomaly score exceeds a given quantile threshold. ThresholdFilter uses a fixed score threshold. ConstantFilter uses a constant threshold value.

Key considerations:

  • QuantileFilter adaptively sets the threshold based on score distribution
  • A quantile of 0.95 flags the top 5% most anomalous observations
  • ThresholdFilter provides a fixed cutoff for deterministic thresholding
  • Filters wrap the detector and expose a classify() method

Step 5: Evaluate Detection Performance

When ground truth labels are available, evaluate the detector using streaming metrics. ROCAUC measures ranking quality of raw anomaly scores. For filtered binary predictions, use standard classification metrics. The progressive validation pattern applies: score first, evaluate, then update the model.

Key considerations:

  • Use ROCAUC for evaluating raw anomaly scores (without filtering)
  • For binary classification after filtering, use Accuracy, F1, Precision, Recall
  • The evaluation loop uses score_one(x) for raw scores or classify for filtered labels
  • progressive_val_score works with anomaly detectors and filters

Step 6: Deploy for Continuous Monitoring

In production, the anomaly detector processes each new observation, returning an anomaly score. The model continuously adapts to evolving normal patterns. Set up alerting based on the filter output or raw score thresholds. The model requires no batch retraining and handles concept drift naturally through its windowed reference mechanism.

Key considerations:

  • HalfSpaceTrees uses a reference/working window swap for adaptation
  • The model adapts to gradual distribution shifts automatically
  • Memory usage is bounded by the window_size and number of trees
  • No retraining or batch processing is needed

Execution Diagram

GitHub URL

Workflow Repository