Principle:Online ml River Built In Datasets
| Knowledge Sources | Machine Learning Experimental Design |
|---|---|
| Domains | Online_Learning Benchmarking Data_Management |
| Last Updated | 2026-02-08 18:00 GMT |
Overview
Built-in benchmark dataset collections provide curated, readily accessible datasets bundled with a machine learning library. They serve as standardized reference points for evaluating algorithms, reproducing experiments, testing implementations, and providing learning examples -- all without requiring users to locate, download, or preprocess external data sources.
Description
A well-designed dataset collection for online ML provides:
- Diverse task coverage: Datasets spanning classification, regression, anomaly detection, clustering, and other tasks to enable comprehensive algorithm evaluation.
- Streaming interface: Data is yielded one instance at a time as (features, target) pairs, matching the online learning paradigm where observations arrive sequentially.
- Metadata: Each dataset exposes properties such as the number of features, number of classes, total instances, and task type.
- Automatic download and caching: Larger datasets are downloaded on first use and cached locally, balancing library size with dataset availability.
- Synthetic and real-world: A mix of synthetic datasets (with known properties) and real-world datasets (reflecting practical challenges).
Dataset categories commonly found in online ML benchmarks:
- Binary classification: Spam detection, intrusion detection, credit scoring.
- Multi-class classification: Image segmentation, text categorization.
- Regression: Demand forecasting, sensor readings, approval ratings.
- Anomaly detection: Network intrusion (HTTP/SMTP), malicious URLs.
- Concept drift: Datasets with known distribution shifts (e.g., insect species over seasons).
- Recommendation: Movie ratings, restaurant reviews.
Usage
Use built-in benchmark datasets when:
- You need a quick, reproducible way to test a new algorithm.
- You want to compare your model against published baselines.
- You are writing tutorials or documentation and need example data.
- You need datasets with specific properties (e.g., concept drift, class imbalance, high dimensionality).
Theoretical Basis
Streaming data abstraction: A dataset is modeled as a (possibly infinite) sequence of tuples:
D = {(x_1, y_1), (x_2, y_2), ..., (x_n, y_n)}
Where is the feature dictionary and is the target. The dataset exposes an iterator interface:
for x, y in dataset:
y_pred = model.predict_one(x)
model.learn_one(x, y)
Dataset taxonomy:
Dataset
|-- SyntheticDataset (generated on-the-fly, potentially infinite)
|-- FileDataset (stored on disk, finite)
|-- LocalDataset (bundled with library)
|-- RemoteDataset (downloaded and cached)
Standardized metadata: Each dataset provides:
- n_samples: int or None (None for infinite/streaming)
- n_features: int
- n_classes: int (for classification)
- task: {classification, regression, anomaly_detection, ...}
- sparse: bool
Statistical properties for benchmarking: Well-chosen benchmark suites cover a range of difficulty dimensions: class imbalance ratios, feature dimensionality, noise levels, concept drift frequency, and dataset size. This enables researchers to characterize algorithm performance across varied conditions rather than on a single favorable scenario.
Related Pages
- Implementation:Online_ml_River_Datasets_Base
- Implementation:Online_ml_River_Datasets_Index
- Implementation:Online_ml_River_Datasets_Bananas
- Implementation:Online_ml_River_Datasets_Bikes
- Implementation:Online_ml_River_Datasets_ChickWeights
- Implementation:Online_ml_River_Datasets_HTTP
- Implementation:Online_ml_River_Datasets_Higgs
- Implementation:Online_ml_River_Datasets_ImageSegments
- Implementation:Online_ml_River_Datasets_Insects
- Implementation:Online_ml_River_Datasets_Keystroke
- Implementation:Online_ml_River_Datasets_MaliciousURL
- Implementation:Online_ml_River_Datasets_MovieLens100K
- Implementation:Online_ml_River_Datasets_Music
- Implementation:Online_ml_River_Datasets_Restaurants
- Implementation:Online_ml_River_Datasets_SMSSpam
- Implementation:Online_ml_River_Datasets_SMTP
- Implementation:Online_ml_River_Datasets_SolarFlare
- Implementation:Online_ml_River_Datasets_TREC07
- Implementation:Online_ml_River_Datasets_Taxis
- Implementation:Online_ml_River_Datasets_TrumpApproval
- Implementation:Online_ml_River_Datasets_WebTraffic
- Principle:Online_ml_River_Streaming_Data_Loading
- Principle:Online_ml_River_Progressive_Validation