Principle:Online ml River Streaming Data Loading

Knowledge Sources	River River Docs
Domains	Online_Learning Data_Ingestion Classification
Last Updated	2026-02-08 16:00 GMT

Overview

Streaming data loading is a technique for loading data as a continuous stream of individual observations rather than batch loading entire datasets into memory.

Description

In online machine learning, models learn incrementally from one observation at a time. This fundamental constraint requires data to be delivered sequentially as a stream of individual samples, rather than as a monolithic in-memory structure like a NumPy array or a Pandas DataFrame. Streaming data loading addresses this by providing iterator-based access to datasets, where each call to the iterator yields a single (x, y) tuple consisting of a feature dictionary and a target value.

River's built-in dataset classes (such as datasets.Phishing, datasets.Bananas, and synthetic generators) implement the __iter__ protocol, which means they can be used directly in for loops. These datasets serve as standardized streaming benchmarks that allow practitioners to evaluate and compare online learning algorithms under consistent conditions. Each dataset is self-contained, bundling metadata (number of samples, number of features, task type) alongside the data itself.

The key advantage of streaming data loading is its constant memory footprint: regardless of the total dataset size, only one observation resides in memory at any time. This makes it feasible to process datasets that are far larger than available RAM, and it mirrors real-world deployment scenarios where data arrives continuously from sensors, user interactions, or network events.

Usage

Use streaming data loading when:

You are training or evaluating an online learning model that processes data one sample at a time.
You need to benchmark models against standardized datasets in a reproducible manner.
The dataset is too large to fit in memory, or you want to simulate a production streaming scenario.
You want to use River's evaluate.progressive_val_score or evaluate.iter_progressive_val_score, both of which expect iterable dataset inputs.

Theoretical Basis

Online learning algorithms are designed around the following protocol for each time step t:

Receive observation $x_{t}$
Predict target ${\hat{y}}_{t} = f (x_{t})$
Receive true target $y_{t}$
Update model parameters using $(x_{t}, y_{t})$

This protocol inherently requires data to arrive one observation at a time. Streaming data loading formalizes this by exposing datasets as Python iterators that yield (x, y) tuples:

for x, y in dataset:
    y_pred = model.predict_one(x)
    model.learn_one(x, y)

River's FileDataset base class wraps on-disk CSV files (optionally compressed) and produces an iterator via stream.iter_csv. The iteration is lazy: rows are read and converted one at a time, ensuring that the memory usage is independent of the total number of samples.

Pseudocode for streaming data loading:

function stream_dataset(file_path, target_column):
    reader = open_csv(file_path)
    for row in reader:
        x = {col: convert(val) for col, val in row if col != target_column}
        y = convert(row[target_column])
        yield (x, y)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment