Implementation:Online ml River Datasets Index
| Knowledge Sources | |
|---|---|
| Domains | Online_Learning, Datasets |
| Last Updated | 2026-02-08 16:00 GMT |
Overview
The datasets module in the River online machine learning library provides a collection of datasets for multiple tasks including classification, regression, and multi-output learning. The data corresponds to popular datasets and are conveniently wrapped to easily iterate over the data in a stream fashion. All datasets have fixed size.
Description
This module contains dataset classes that provide streaming access to various benchmark datasets commonly used in machine learning research. The datasets cover diverse domains including:
- Binary and multi-class classification
- Regression and multi-output regression
- Multi-label classification
- Text classification
- Time series forecasting
- Anomaly detection
- Recommender systems
All datasets implement a consistent interface allowing users to iterate over (features, target) pairs in a streaming fashion, making them ideal for online learning algorithms.
Usage
Use this module when you need:
- Standard benchmark datasets for evaluating online learning algorithms
- Real-world datasets with various characteristics (imbalanced, high-dimensional, sparse, etc.)
- Datasets with concept drift for testing adaptive algorithms
- Datasets for specific domains (text, images, time series, etc.)
Code Reference
Source Location
- Repository: Online_ml_River
- File: river/datasets/__init__.py
Module Exports
__all__ = [
"AirlinePassengers",
"Bananas",
"base",
"Bikes",
"ChickWeights",
"CreditCard",
"Elec2",
"Higgs",
"HTTP",
"ImageSegments",
"Insects",
"Keystroke",
"MaliciousURL",
"MovieLens100K",
"Music",
"Phishing",
"Restaurants",
"SMSSpam",
"SMTP",
"SolarFlare",
"synth",
"Taxis",
"TREC07",
"TrumpApproval",
"WaterFlow",
"WebTraffic",
]
Import
from river import datasets
# Import specific dataset
dataset = datasets.Bikes()
# Or import directly
from river.datasets import Bikes
dataset = Bikes()
Available Datasets
Binary Classification
| Dataset | Samples | Features | Description |
|---|---|---|---|
| Bananas | 5,300 | 2 | Artificial banana-shaped clusters |
| CreditCard | 284,807 | 29 | Credit card fraud detection |
| HTTP | 567,498 | 3 | HTTP anomaly detection (0.4% positive) |
| MaliciousURL | 2,396,130 | 3,231,961 | URL malware detection (sparse) |
| Phishing | 1,250 | 9 | Phishing website detection |
| SMSSpam | 5,574 | 1 | SMS spam detection (text) |
| SMTP | 95,156 | 3 | SMTP anomaly detection |
| TREC07 | 75,419 | 5 | Email spam detection (text) |
| Higgs | 11,000,000 | 28 | Particle physics signal detection |
Multi-Class Classification
| Dataset | Samples | Features | Classes | Description |
|---|---|---|---|---|
| Elec2 | 45,312 | 8 | 2 | Electricity demand (concept drift) |
| ImageSegments | 2,310 | 18 | 7 | Image segment classification |
| Insects | 52,848+ | 33 | 6 | Concept drift evaluation (variants) |
| Keystroke | 20,400 | 31 | 51 | User identification by keystroke |
Regression
| Dataset | Samples | Features | Description |
|---|---|---|---|
| AirlinePassengers | 144 | 1 | Monthly airline passengers |
| Bikes | 182,470 | 8 | Bike sharing demand |
| ChickWeights | 578 | 3 | Chick weight over time |
| MovieLens100K | 100,000 | 10 | Movie rating prediction |
| Restaurants | 252,108 | 7 | Restaurant visitor prediction |
| Taxis | 1,458,644 | 8 | NYC taxi trip duration |
| TrumpApproval | 1,001 | 6 | Approval rating prediction |
Multi-Output Classification
| Dataset | Samples | Features | Outputs | Description |
|---|---|---|---|---|
| Music | 593 | 72 | 6 | Multi-label mood prediction |
Multi-Output Regression
| Dataset | Samples | Features | Outputs | Description |
|---|---|---|---|---|
| SolarFlare | 1,066 | 10 | 3 | Solar flare prediction |
| WebTraffic | 44,160 | 3 | 2 | Web session prediction |
Common Usage Patterns
Basic Iteration
from river import datasets
dataset = datasets.Bikes()
for x, y in dataset:
# x is a dictionary of features
# y is the target value
print(x, y)
break
With Online Learning
from river import datasets, linear_model, metrics
dataset = datasets.Bikes()
model = linear_model.LinearRegression()
metric = metrics.MAE()
for x, y in dataset:
y_pred = model.predict_one(x)
metric.update(y, y_pred)
model.learn_one(x, y)
print(metric)
With Preprocessing
from river import datasets, preprocessing, naive_bayes
dataset = datasets.SMSSpam()
model = (
preprocessing.BagOfWords() |
naive_bayes.BernoulliNB()
)
for x, y in dataset:
model.predict_one(x)
model.learn_one(x, y)
Dataset Properties
All dataset classes provide the following properties:
- n_samples: Total number of samples in the dataset
- n_features: Number of features
- task: Type of task (BINARY_CLF, MULTI_CLF, REG, MO_REG, MO_BINARY_CLF)
- n_classes: Number of classes (for classification tasks)
- n_outputs: Number of outputs (for multi-output tasks)
- sparse: Whether features are sparse (for high-dimensional datasets)
Base Classes
The module uses two main base classes:
- FileDataset: For datasets bundled with the library
- RemoteDataset: For datasets downloaded from remote URLs
Both provide consistent iteration interfaces and automatic downloading/caching.
Synthetic Data
For infinite synthetic data generators, see the synth submodule:
from river.datasets import synth
# Generate synthetic data
generator = synth.Agrawal()
for x, y in generator.take(1000):
print(x, y)