Implementation:Online ml River Datasets CreditCard
| Knowledge Sources | River River Docs |
|---|---|
| Domains | Online Machine Learning, Anomaly Detection, Benchmarking, Datasets |
| Last Updated | 2026-02-08 16:00 GMT |
Overview
Concrete tool for accessing the CreditCard fraud detection and HTTP intrusion detection benchmark datasets in the River library, providing labeled streaming data for evaluating online anomaly detectors.
Description
This implementation covers two dataset classes used for anomaly detection benchmarking:
datasets.CreditCard -- Contains 284,807 credit card transactions from European cardholders (September 2013), with 492 fraudulent transactions (0.172%). Features are PCA-transformed numerical values (V1-V28) plus Time and Amount, totaling 30 features. The target variable is "Class" (0=normal, 1=fraud).
datasets.HTTP -- Contains 567,498 HTTP connections from the KDD 1999 cup, with 2,211 anomalous connections (0.39%). Features are duration, src_bytes, and dst_bytes (3 features). The target variable is "service" (0=normal, 1=anomaly).
Both classes inherit from base.RemoteDataset (CreditCard) or base.RemoteDataset (HTTP). The datasets are downloaded from a remote URL on first use and cached locally. They yield (x, y) tuples where x is a feature dictionary and y is an integer label.
Usage
Import and use these datasets when:
- You need a standard benchmark for evaluating anomaly detection algorithms
- You want to reproduce results from River's documentation
- You need labeled streaming data with realistic class imbalance
Code Reference
Source Location
river/datasets/credit_card.py, lines 8-54 (CreditCard)river/datasets/http.py, lines 8-37 (HTTP)
Signatures
class CreditCard(base.RemoteDataset):
def __init__(self) -> None:
super().__init__(
n_samples=284_807,
n_features=30,
task=base.BINARY_CLF,
url="https://maxhalford.github.io/files/datasets/creditcardfraud.zip",
size=150_828_752,
filename="creditcard.csv",
)
class HTTP(base.RemoteDataset):
def __init__(self) -> None:
super().__init__(
n_samples=567_498,
n_features=3,
task=base.BINARY_CLF,
url="https://maxhalford.github.io/files/datasets/kdd99_http.zip",
size=32_400_738,
filename="kdd99_http.csv",
)
Import
from river import datasets
cc = datasets.CreditCard()
http = datasets.HTTP()
Parameters
Neither class takes constructor parameters. Dataset metadata is set internally:
| Dataset | n_samples | n_features | Task | File Size |
|---|---|---|---|---|
| CreditCard | 284,807 | 30 | Binary Classification | ~150 MB |
| HTTP | 567,498 | 3 | Binary Classification | ~32 MB |
Methods
Both datasets inherit standard dataset methods:
- Iteration via
for x, y in dataset-- yields (dict, int) tuples. .take(n)-- limits iteration to the first n observations.
I/O Contract
Inputs
No inputs required. Datasets are self-contained.
Outputs
| Output | Type | Description |
|---|---|---|
| x | dict | Feature dictionary. CreditCard: {'V1': float, ..., 'V28': float, 'Time': float, 'Amount': float}. HTTP: {'duration': float, 'src_bytes': float, 'dst_bytes': float}. |
| y | int | Target label. 0 = normal, 1 = anomaly. |
CreditCard Features
| Feature | Type | Description |
|---|---|---|
| V1 - V28 | float | PCA-transformed features (original features anonymized for confidentiality). |
| Time | float | Seconds elapsed since the first transaction in the dataset. |
| Amount | float | Transaction amount. |
HTTP Features
| Feature | Type | Description |
|---|---|---|
| duration | float | Duration of the HTTP connection. |
| src_bytes | float | Number of bytes sent from source. |
| dst_bytes | float | Number of bytes sent from destination. |
Usage Examples
Iterating over CreditCard dataset:
from river import datasets
for x, y in datasets.CreditCard().take(5):
print(f"Features: {len(x)} keys, Label: {y}")
# Features: 30 keys, Label: 0
# Features: 30 keys, Label: 0
# ...
Evaluating HalfSpaceTrees on CreditCard:
from river import anomaly, compose, datasets, metrics, preprocessing
model = compose.Pipeline(
preprocessing.MinMaxScaler(),
anomaly.HalfSpaceTrees(seed=42)
)
auc = metrics.ROCAUC()
for x, y in datasets.CreditCard().take(2500):
score = model.score_one(x)
model.learn_one(x)
auc.update(y, score)
print(auc)
# ROCAUC: 91.15%
Using HTTP dataset:
from river import anomaly, datasets, metrics
model = anomaly.HalfSpaceTrees(seed=42)
auc = metrics.ROCAUC()
for x, y in datasets.HTTP().take(1000):
score = model.score_one(x)
model.learn_one(x)
auc.update(y, score)
print(auc)
Using take() for quick experiments:
from river import datasets
# Full dataset
cc_full = datasets.CreditCard()
print(f"CreditCard: {cc_full.n_samples} samples, {cc_full.n_features} features")
# CreditCard: 284807 samples, 30 features
# Subset for prototyping
cc_small = datasets.CreditCard().take(2500)