Implementation:Online ml River Datasets Index

Knowledge Sources	Online_ml_River
Domains	Online_Learning, Datasets
Last Updated	2026-02-08 16:00 GMT

Overview

The datasets module in the River online machine learning library provides a collection of datasets for multiple tasks including classification, regression, and multi-output learning. The data corresponds to popular datasets and are conveniently wrapped to easily iterate over the data in a stream fashion. All datasets have fixed size.

Description

This module contains dataset classes that provide streaming access to various benchmark datasets commonly used in machine learning research. The datasets cover diverse domains including:

Binary and multi-class classification
Regression and multi-output regression
Multi-label classification
Text classification
Time series forecasting
Anomaly detection
Recommender systems

All datasets implement a consistent interface allowing users to iterate over (features, target) pairs in a streaming fashion, making them ideal for online learning algorithms.

Usage

Use this module when you need:

Standard benchmark datasets for evaluating online learning algorithms
Real-world datasets with various characteristics (imbalanced, high-dimensional, sparse, etc.)
Datasets with concept drift for testing adaptive algorithms
Datasets for specific domains (text, images, time series, etc.)

Code Reference

Source Location

Repository: Online_ml_River
File: river/datasets/__init__.py

Module Exports

__all__ = [
    "AirlinePassengers",
    "Bananas",
    "base",
    "Bikes",
    "ChickWeights",
    "CreditCard",
    "Elec2",
    "Higgs",
    "HTTP",
    "ImageSegments",
    "Insects",
    "Keystroke",
    "MaliciousURL",
    "MovieLens100K",
    "Music",
    "Phishing",
    "Restaurants",
    "SMSSpam",
    "SMTP",
    "SolarFlare",
    "synth",
    "Taxis",
    "TREC07",
    "TrumpApproval",
    "WaterFlow",
    "WebTraffic",
]

Import

from river import datasets

# Import specific dataset
dataset = datasets.Bikes()

# Or import directly
from river.datasets import Bikes
dataset = Bikes()

Available Datasets

Binary Classification

Dataset	Samples	Features	Description
Bananas	5,300	2	Artificial banana-shaped clusters
CreditCard	284,807	29	Credit card fraud detection
HTTP	567,498	3	HTTP anomaly detection (0.4% positive)
MaliciousURL	2,396,130	3,231,961	URL malware detection (sparse)
Phishing	1,250	9	Phishing website detection
SMSSpam	5,574	1	SMS spam detection (text)
SMTP	95,156	3	SMTP anomaly detection
TREC07	75,419	5	Email spam detection (text)
Higgs	11,000,000	28	Particle physics signal detection

Multi-Class Classification

Dataset	Samples	Features	Classes	Description
Elec2	45,312	8	2	Electricity demand (concept drift)
ImageSegments	2,310	18	7	Image segment classification
Insects	52,848+	33	6	Concept drift evaluation (variants)
Keystroke	20,400	31	51	User identification by keystroke

Regression

Dataset	Samples	Features	Description
AirlinePassengers	144	1	Monthly airline passengers
Bikes	182,470	8	Bike sharing demand
ChickWeights	578	3	Chick weight over time
MovieLens100K	100,000	10	Movie rating prediction
Restaurants	252,108	7	Restaurant visitor prediction
Taxis	1,458,644	8	NYC taxi trip duration
TrumpApproval	1,001	6	Approval rating prediction

Multi-Output Classification

Dataset	Samples	Features	Outputs	Description
Music	593	72	6	Multi-label mood prediction

Multi-Output Regression

Dataset	Samples	Features	Outputs	Description
SolarFlare	1,066	10	3	Solar flare prediction
WebTraffic	44,160	3	2	Web session prediction

Common Usage Patterns

Basic Iteration

from river import datasets

dataset = datasets.Bikes()
for x, y in dataset:
    # x is a dictionary of features
    # y is the target value
    print(x, y)
    break

With Online Learning

from river import datasets, linear_model, metrics

dataset = datasets.Bikes()
model = linear_model.LinearRegression()
metric = metrics.MAE()

for x, y in dataset:
    y_pred = model.predict_one(x)
    metric.update(y, y_pred)
    model.learn_one(x, y)

print(metric)

With Preprocessing

from river import datasets, preprocessing, naive_bayes

dataset = datasets.SMSSpam()
model = (
    preprocessing.BagOfWords() |
    naive_bayes.BernoulliNB()
)

for x, y in dataset:
    model.predict_one(x)
    model.learn_one(x, y)

Dataset Properties

All dataset classes provide the following properties:

n_samples: Total number of samples in the dataset
n_features: Number of features
task: Type of task (BINARY_CLF, MULTI_CLF, REG, MO_REG, MO_BINARY_CLF)
n_classes: Number of classes (for classification tasks)
n_outputs: Number of outputs (for multi-output tasks)
sparse: Whether features are sparse (for high-dimensional datasets)

Base Classes

The module uses two main base classes:

FileDataset: For datasets bundled with the library
RemoteDataset: For datasets downloaded from remote URLs

Both provide consistent iteration interfaces and automatic downloading/caching.

Synthetic Data

For infinite synthetic data generators, see the synth submodule:

from river.datasets import synth

# Generate synthetic data
generator = synth.Agrawal()
for x, y in generator.take(1000):
    print(x, y)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment