Implementation:Online ml River Datasets CreditCard

Knowledge Sources	River River Docs
Domains	Online Machine Learning, Anomaly Detection, Benchmarking, Datasets
Last Updated	2026-02-08 16:00 GMT

Overview

Concrete tool for accessing the CreditCard fraud detection and HTTP intrusion detection benchmark datasets in the River library, providing labeled streaming data for evaluating online anomaly detectors.

Description

This implementation covers two dataset classes used for anomaly detection benchmarking:

datasets.CreditCard -- Contains 284,807 credit card transactions from European cardholders (September 2013), with 492 fraudulent transactions (0.172%). Features are PCA-transformed numerical values (V1-V28) plus Time and Amount, totaling 30 features. The target variable is "Class" (0=normal, 1=fraud).

datasets.HTTP -- Contains 567,498 HTTP connections from the KDD 1999 cup, with 2,211 anomalous connections (0.39%). Features are duration, src_bytes, and dst_bytes (3 features). The target variable is "service" (0=normal, 1=anomaly).

Both classes inherit from base.RemoteDataset (CreditCard) or base.RemoteDataset (HTTP). The datasets are downloaded from a remote URL on first use and cached locally. They yield (x, y) tuples where x is a feature dictionary and y is an integer label.

Usage

Import and use these datasets when:

You need a standard benchmark for evaluating anomaly detection algorithms
You want to reproduce results from River's documentation
You need labeled streaming data with realistic class imbalance

Code Reference

Source Location

river/datasets/credit_card.py, lines 8-54 (CreditCard)
river/datasets/http.py, lines 8-37 (HTTP)

Signatures

class CreditCard(base.RemoteDataset):
    def __init__(self) -> None:
        super().__init__(
            n_samples=284_807,
            n_features=30,
            task=base.BINARY_CLF,
            url="https://maxhalford.github.io/files/datasets/creditcardfraud.zip",
            size=150_828_752,
            filename="creditcard.csv",
        )

class HTTP(base.RemoteDataset):
    def __init__(self) -> None:
        super().__init__(
            n_samples=567_498,
            n_features=3,
            task=base.BINARY_CLF,
            url="https://maxhalford.github.io/files/datasets/kdd99_http.zip",
            size=32_400_738,
            filename="kdd99_http.csv",
        )

Import

from river import datasets

cc = datasets.CreditCard()
http = datasets.HTTP()

Parameters

Neither class takes constructor parameters. Dataset metadata is set internally:

Dataset	n_samples	n_features	Task	File Size
CreditCard	284,807	30	Binary Classification	~150 MB
HTTP	567,498	3	Binary Classification	~32 MB

Methods

Both datasets inherit standard dataset methods:

Iteration via for x, y in dataset -- yields (dict, int) tuples.
.take(n) -- limits iteration to the first n observations.

I/O Contract

Inputs

No inputs required. Datasets are self-contained.

Outputs

Output	Type	Description
x	dict	Feature dictionary. CreditCard: {'V1': float, ..., 'V28': float, 'Time': float, 'Amount': float}. HTTP: {'duration': float, 'src_bytes': float, 'dst_bytes': float}.
y	int	Target label. 0 = normal, 1 = anomaly.

CreditCard Features

Feature	Type	Description
V1 - V28	float	PCA-transformed features (original features anonymized for confidentiality).
Time	float	Seconds elapsed since the first transaction in the dataset.
Amount	float	Transaction amount.

HTTP Features

Feature	Type	Description
duration	float	Duration of the HTTP connection.
src_bytes	float	Number of bytes sent from source.
dst_bytes	float	Number of bytes sent from destination.

Usage Examples

Iterating over CreditCard dataset:

from river import datasets

for x, y in datasets.CreditCard().take(5):
    print(f"Features: {len(x)} keys, Label: {y}")
# Features: 30 keys, Label: 0
# Features: 30 keys, Label: 0
# ...

Evaluating HalfSpaceTrees on CreditCard:

from river import anomaly, compose, datasets, metrics, preprocessing

model = compose.Pipeline(
    preprocessing.MinMaxScaler(),
    anomaly.HalfSpaceTrees(seed=42)
)

auc = metrics.ROCAUC()

for x, y in datasets.CreditCard().take(2500):
    score = model.score_one(x)
    model.learn_one(x)
    auc.update(y, score)

print(auc)
# ROCAUC: 91.15%

Using HTTP dataset:

from river import anomaly, datasets, metrics

model = anomaly.HalfSpaceTrees(seed=42)
auc = metrics.ROCAUC()

for x, y in datasets.HTTP().take(1000):
    score = model.score_one(x)
    model.learn_one(x)
    auc.update(y, score)

print(auc)

Using take() for quick experiments:

from river import datasets

# Full dataset
cc_full = datasets.CreditCard()
print(f"CreditCard: {cc_full.n_samples} samples, {cc_full.n_features} features")
# CreditCard: 284807 samples, 30 features

# Subset for prototyping
cc_small = datasets.CreditCard().take(2500)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment