Principle:Online ml River DataFrame Stream Ingestion
| Knowledge Sources | Domains | Last Updated |
|---|---|---|
| River River Docs | Online Machine Learning, Data Streaming, Pandas Integration | 2026-02-08 16:00 GMT |
Overview
DataFrame Stream Ingestion is the technique for converting pandas DataFrames into observation-by-observation data streams compatible with River's online learning API.
Description
In online machine learning, models learn from data one observation at a time rather than in batch. However, datasets are frequently stored in tabular formats such as pandas DataFrames. DataFrame Stream Ingestion bridges this gap by providing a systematic method to iterate over the rows of a DataFrame, yielding each row as a Python dictionary alongside an optional target value. This conversion is essential because River's entire API -- from clustering to classification to regression -- operates on individual observations represented as feature dictionaries.
The ingestion process works by iterating row-by-row through the DataFrame, converting each row into a dict mapping feature names to values. When a target column (or multiple target columns) is provided as a separate pd.Series or pd.DataFrame, the function yields (x, y) tuples where x is the feature dictionary and y is the corresponding target value. When no target is supplied, y is None.
Internally, the DataFrame is converted to a NumPy array via to_numpy(), and the iteration is delegated to stream.iter_array, which handles the actual row-by-row dictionary construction using the original column names as keys.
Usage
Use DataFrame Stream Ingestion whenever you have a pandas DataFrame and need to feed it into any River model's learn_one / predict_one loop. This is especially common during:
- Prototyping: When testing an online learning algorithm on a small dataset already loaded into a DataFrame.
- Benchmarking: When comparing River's online models against batch implementations using the same DataFrame.
- Unsupervised streaming: When feeding data into a clustering algorithm such as
cluster.KMeanswhere no target column is needed (passy=None). - Supervised streaming: When the DataFrame contains both features and a target column that must be separated before streaming.
Theoretical Basis
The theoretical basis for DataFrame Stream Ingestion lies in the distinction between the batch learning paradigm and the online (incremental) learning paradigm:
Batch learning assumes access to the entire dataset at once. Models are trained by making multiple passes over all data points simultaneously. Data structures like DataFrames and matrices naturally support this pattern.
Online learning processes data one observation at a time in a single pass. Each observation is a dictionary {feature_name: value}, and the model updates its internal state immediately upon seeing it.
The conversion follows this straightforward procedure:
PROCEDURE iter_pandas(X: DataFrame, y: Series or None):
feature_names = X.columns
FOR i = 0 TO len(X) - 1:
x_dict = {feature_names[j]: X[i, j] for j in 0..num_features-1}
y_val = y[i] if y is not None else None
YIELD (x_dict, y_val)
This row-by-row iteration transforms the columnar storage of a DataFrame into a sequential stream of individual observations, which is the fundamental data interface for all River estimators.