Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Online ml River Streaming Anomaly Datasets

From Leeroopedia


Knowledge Sources River River Docs
Domains Online Machine Learning, Anomaly Detection, Benchmarking, Datasets
Last Updated 2026-02-08 16:00 GMT

Overview

Built-in benchmark datasets for evaluating online anomaly detection algorithms on labeled streaming data, providing standardized benchmarks with known anomaly proportions for reproducible evaluation.

Description

Streaming Anomaly Datasets are pre-packaged, labeled datasets included in River's datasets module that serve as standard benchmarks for evaluating anomaly detection algorithms. Each dataset provides a stream of (features, label) pairs, where the label indicates whether each observation is normal (0) or anomalous (1).

These datasets are critical for:

  • Reproducible evaluation: Standardized benchmarks allow direct comparison of different anomaly detection approaches.
  • Realistic class imbalance: Real-world anomaly detection problems are highly imbalanced, and these datasets reflect that -- anomalies typically represent a tiny fraction of all observations.
  • Streaming compatibility: Datasets are designed to be iterated one observation at a time, matching River's online learning paradigm.

River provides two primary anomaly detection benchmark datasets:

CreditCard: A fraud detection dataset containing 284,807 credit card transactions from European cardholders over two days (September 2013). Only 492 transactions (0.172%) are fraudulent. Features are PCA-transformed (V1-V28) plus Time and Amount, totaling 30 features.

HTTP: An intrusion detection dataset from the KDD 1999 cup containing 567,498 HTTP connections. Only 2,211 (0.39%) are anomalous. It has 3 numeric features (duration, src_bytes, dst_bytes).

Both datasets inherit from base.RemoteDataset, meaning they are downloaded on first use and cached locally.

Usage

Use streaming anomaly datasets when:

  • You need to benchmark an anomaly detection algorithm
  • You want to compare different detector configurations or algorithms
  • You need labeled data for computing ROCAUC or other classification metrics
  • You are prototyping an anomaly detection pipeline and need representative data
  • You want to reproduce results from River's documentation or papers

Theoretical Basis

Dataset characteristics:

Dataset Samples Features Anomaly % Domain Source
CreditCard 284,807 30 0.172% (492 frauds) Fraud Detection ULB Machine Learning Group
HTTP 567,498 3 0.39% (2,211 anomalies) Intrusion Detection KDD Cup 1999

Evaluation protocol for anomaly detection:

Since anomaly detectors in River are unsupervised (they only see features during training, not labels), the labels in these datasets are used exclusively for evaluation, not for training.

for x, y in dataset:
    score = model.score_one(x)       # Predict (unsupervised)
    metric.update(y, score)           # Evaluate against ground truth
    model.learn_one(x)               # Learn (unsupervised, no y)

Class imbalance considerations:

  • Standard accuracy is misleading (a model that always predicts "normal" achieves >99% accuracy)
  • ROCAUC is the recommended metric: it evaluates the quality of the score ranking regardless of threshold
  • ClassificationReport (precision, recall, F1) is useful when evaluating with a specific filter threshold

Streaming iteration:

Both datasets support the .take(n) method to limit the number of observations, useful for quick experiments:

dataset = CreditCard().take(2500)  # Only first 2,500 observations

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment