Implementation:Online ml River Stream Iter Csv
| Knowledge Sources | River River Docs |
|---|---|
| Domains | Online_Learning Data_Ingestion ETL |
| Last Updated | 2026-02-08 16:00 GMT |
Overview
Concrete tool for converting CSV files into observation-by-observation data streams with configurable type conversion, date parsing, column dropping, and random sampling.
Description
The stream.iter_csv function reads a CSV file (or buffer) and yields one (x, y) tuple at a time, where x is a feature dictionary and y is the target value (or None if no target column is specified). It supports on-the-fly type conversion via a converters dictionary, date parsing via parse_dates, column exclusion via drop, and random sub-sampling via fraction.
Under the hood, it uses a custom DictReader subclass that extends Python's csv.DictReader with Bernoulli sampling support. Compressed files (.gz, .zip) are transparently decompressed when compression="infer".
This function is the core data ingestion primitive used by all of River's built-in FileDataset classes (such as datasets.Phishing). It is also available directly for loading custom CSV data.
Usage
Import this function when you need to:
- Stream a custom CSV file into a River model for training or evaluation.
- Control type conversion, date parsing, or column selection during ingestion.
- Sub-sample a large CSV file for rapid prototyping.
- Build a custom dataset class that wraps a CSV file.
Code Reference
Source Location
| File | Lines |
|---|---|
river/stream/iter_csv.py |
L34-L189 |
Signature
def iter_csv(
filepath_or_buffer,
target: str | list[str] | None = None,
converters: dict | None = None,
parse_dates: dict | None = None,
drop: list[str] | None = None,
drop_nones=False,
fraction=1.0,
compression="infer",
seed: int | None = None,
field_size_limit: int | None = None,
**kwargs,
) -> base.typing.Stream
Import
from river import stream
dataset = stream.iter_csv('data.csv', target='label')
I/O Contract
Inputs
| Parameter | Type | Default | Description |
|---|---|---|---|
filepath_or_buffer |
str or buffer |
(required) | Path to a CSV file or a buffer with a read method.
|
target |
list[str] | None | None |
Name of the target column. If a list, multiple output targets are extracted. If None, y is always None.
|
converters |
None | None |
Mapping of column names to callables for type conversion (e.g., {'age': int, 'score': float}).
|
parse_dates |
None | None |
Mapping of column names to datetime format strings for date parsing. |
drop |
None | None |
Column names to exclude from the feature dictionary. |
drop_nones |
bool |
False |
Whether to drop features with None values.
|
fraction |
float |
1.0 |
Sampling fraction in (0, 1]. Values below 1.0 enable Bernoulli sampling.
|
compression |
str |
"infer" |
Decompression method. "infer" detects from file extension (.gz, .zip).
|
seed |
None | None |
Random seed for deterministic sampling. |
field_size_limit |
None | None |
Maximum field size for the CSV reader. |
**kwargs |
Additional keyword arguments passed to csv.DictReader.
|
Outputs
| Output | Type | Description |
|---|---|---|
| Return value | base.typing.Stream |
A generator yielding (x: dict, y) tuples. x is a dictionary of feature names to values. y is the target value (type depends on converters) or None if no target is specified.
|
Usage Examples
Basic CSV streaming with target:
from river import stream
for x, y in stream.iter_csv('data.csv', target='label'):
print(x, y)
With type converters and date parsing:
from river import stream
params = {
'converters': {'rating': float},
'parse_dates': {'year': '%Y'}
}
for x, y in stream.iter_csv('tv_shows.csv', target='rating', **params):
print(x, y)
# {'name': 'Planet Earth II', 'year': datetime.datetime(2016, 1, 1, 0, 0)} 9.5
# ...
Sub-sampling a large file:
from river import stream
# Only read ~10% of the rows, deterministically
for x, y in stream.iter_csv('large_data.csv', target='label', fraction=0.1, seed=42):
print(x, y)
Without a target column:
from river import stream
for x, y in stream.iter_csv('features_only.csv'):
print(x, y)
# y is always None