Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Online ml River Datasets Index

From Leeroopedia


Knowledge Sources
Domains Online_Learning, Datasets
Last Updated 2026-02-08 16:00 GMT

Overview

The datasets module in the River online machine learning library provides a collection of datasets for multiple tasks including classification, regression, and multi-output learning. The data corresponds to popular datasets and are conveniently wrapped to easily iterate over the data in a stream fashion. All datasets have fixed size.

Description

This module contains dataset classes that provide streaming access to various benchmark datasets commonly used in machine learning research. The datasets cover diverse domains including:

  • Binary and multi-class classification
  • Regression and multi-output regression
  • Multi-label classification
  • Text classification
  • Time series forecasting
  • Anomaly detection
  • Recommender systems

All datasets implement a consistent interface allowing users to iterate over (features, target) pairs in a streaming fashion, making them ideal for online learning algorithms.

Usage

Use this module when you need:

  • Standard benchmark datasets for evaluating online learning algorithms
  • Real-world datasets with various characteristics (imbalanced, high-dimensional, sparse, etc.)
  • Datasets with concept drift for testing adaptive algorithms
  • Datasets for specific domains (text, images, time series, etc.)

Code Reference

Source Location

Module Exports

__all__ = [
    "AirlinePassengers",
    "Bananas",
    "base",
    "Bikes",
    "ChickWeights",
    "CreditCard",
    "Elec2",
    "Higgs",
    "HTTP",
    "ImageSegments",
    "Insects",
    "Keystroke",
    "MaliciousURL",
    "MovieLens100K",
    "Music",
    "Phishing",
    "Restaurants",
    "SMSSpam",
    "SMTP",
    "SolarFlare",
    "synth",
    "Taxis",
    "TREC07",
    "TrumpApproval",
    "WaterFlow",
    "WebTraffic",
]

Import

from river import datasets

# Import specific dataset
dataset = datasets.Bikes()

# Or import directly
from river.datasets import Bikes
dataset = Bikes()

Available Datasets

Binary Classification

Dataset Samples Features Description
Bananas 5,300 2 Artificial banana-shaped clusters
CreditCard 284,807 29 Credit card fraud detection
HTTP 567,498 3 HTTP anomaly detection (0.4% positive)
MaliciousURL 2,396,130 3,231,961 URL malware detection (sparse)
Phishing 1,250 9 Phishing website detection
SMSSpam 5,574 1 SMS spam detection (text)
SMTP 95,156 3 SMTP anomaly detection
TREC07 75,419 5 Email spam detection (text)
Higgs 11,000,000 28 Particle physics signal detection

Multi-Class Classification

Dataset Samples Features Classes Description
Elec2 45,312 8 2 Electricity demand (concept drift)
ImageSegments 2,310 18 7 Image segment classification
Insects 52,848+ 33 6 Concept drift evaluation (variants)
Keystroke 20,400 31 51 User identification by keystroke

Regression

Dataset Samples Features Description
AirlinePassengers 144 1 Monthly airline passengers
Bikes 182,470 8 Bike sharing demand
ChickWeights 578 3 Chick weight over time
MovieLens100K 100,000 10 Movie rating prediction
Restaurants 252,108 7 Restaurant visitor prediction
Taxis 1,458,644 8 NYC taxi trip duration
TrumpApproval 1,001 6 Approval rating prediction

Multi-Output Classification

Dataset Samples Features Outputs Description
Music 593 72 6 Multi-label mood prediction

Multi-Output Regression

Dataset Samples Features Outputs Description
SolarFlare 1,066 10 3 Solar flare prediction
WebTraffic 44,160 3 2 Web session prediction

Common Usage Patterns

Basic Iteration

from river import datasets

dataset = datasets.Bikes()
for x, y in dataset:
    # x is a dictionary of features
    # y is the target value
    print(x, y)
    break

With Online Learning

from river import datasets, linear_model, metrics

dataset = datasets.Bikes()
model = linear_model.LinearRegression()
metric = metrics.MAE()

for x, y in dataset:
    y_pred = model.predict_one(x)
    metric.update(y, y_pred)
    model.learn_one(x, y)

print(metric)

With Preprocessing

from river import datasets, preprocessing, naive_bayes

dataset = datasets.SMSSpam()
model = (
    preprocessing.BagOfWords() |
    naive_bayes.BernoulliNB()
)

for x, y in dataset:
    model.predict_one(x)
    model.learn_one(x, y)

Dataset Properties

All dataset classes provide the following properties:

  • n_samples: Total number of samples in the dataset
  • n_features: Number of features
  • task: Type of task (BINARY_CLF, MULTI_CLF, REG, MO_REG, MO_BINARY_CLF)
  • n_classes: Number of classes (for classification tasks)
  • n_outputs: Number of outputs (for multi-output tasks)
  • sparse: Whether features are sparse (for high-dimensional datasets)

Base Classes

The module uses two main base classes:

  • FileDataset: For datasets bundled with the library
  • RemoteDataset: For datasets downloaded from remote URLs

Both provide consistent iteration interfaces and automatic downloading/caching.

Synthetic Data

For infinite synthetic data generators, see the synth submodule:

from river.datasets import synth

# Generate synthetic data
generator = synth.Agrawal()
for x, y in generator.take(1000):
    print(x, y)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment