Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Online ml River Stream Iter Csv

From Leeroopedia


Knowledge Sources River River Docs
Domains Online_Learning Data_Ingestion ETL
Last Updated 2026-02-08 16:00 GMT

Overview

Concrete tool for converting CSV files into observation-by-observation data streams with configurable type conversion, date parsing, column dropping, and random sampling.

Description

The stream.iter_csv function reads a CSV file (or buffer) and yields one (x, y) tuple at a time, where x is a feature dictionary and y is the target value (or None if no target column is specified). It supports on-the-fly type conversion via a converters dictionary, date parsing via parse_dates, column exclusion via drop, and random sub-sampling via fraction.

Under the hood, it uses a custom DictReader subclass that extends Python's csv.DictReader with Bernoulli sampling support. Compressed files (.gz, .zip) are transparently decompressed when compression="infer".

This function is the core data ingestion primitive used by all of River's built-in FileDataset classes (such as datasets.Phishing). It is also available directly for loading custom CSV data.

Usage

Import this function when you need to:

  • Stream a custom CSV file into a River model for training or evaluation.
  • Control type conversion, date parsing, or column selection during ingestion.
  • Sub-sample a large CSV file for rapid prototyping.
  • Build a custom dataset class that wraps a CSV file.

Code Reference

Source Location

File Lines
river/stream/iter_csv.py L34-L189

Signature

def iter_csv(
    filepath_or_buffer,
    target: str | list[str] | None = None,
    converters: dict | None = None,
    parse_dates: dict | None = None,
    drop: list[str] | None = None,
    drop_nones=False,
    fraction=1.0,
    compression="infer",
    seed: int | None = None,
    field_size_limit: int | None = None,
    **kwargs,
) -> base.typing.Stream

Import

from river import stream

dataset = stream.iter_csv('data.csv', target='label')

I/O Contract

Inputs

Parameter Type Default Description
filepath_or_buffer str or buffer (required) Path to a CSV file or a buffer with a read method.
target list[str] | None None Name of the target column. If a list, multiple output targets are extracted. If None, y is always None.
converters None None Mapping of column names to callables for type conversion (e.g., {'age': int, 'score': float}).
parse_dates None None Mapping of column names to datetime format strings for date parsing.
drop None None Column names to exclude from the feature dictionary.
drop_nones bool False Whether to drop features with None values.
fraction float 1.0 Sampling fraction in (0, 1]. Values below 1.0 enable Bernoulli sampling.
compression str "infer" Decompression method. "infer" detects from file extension (.gz, .zip).
seed None None Random seed for deterministic sampling.
field_size_limit None None Maximum field size for the CSV reader.
**kwargs Additional keyword arguments passed to csv.DictReader.

Outputs

Output Type Description
Return value base.typing.Stream A generator yielding (x: dict, y) tuples. x is a dictionary of feature names to values. y is the target value (type depends on converters) or None if no target is specified.

Usage Examples

Basic CSV streaming with target:

from river import stream

for x, y in stream.iter_csv('data.csv', target='label'):
    print(x, y)

With type converters and date parsing:

from river import stream

params = {
    'converters': {'rating': float},
    'parse_dates': {'year': '%Y'}
}

for x, y in stream.iter_csv('tv_shows.csv', target='rating', **params):
    print(x, y)
# {'name': 'Planet Earth II', 'year': datetime.datetime(2016, 1, 1, 0, 0)} 9.5
# ...

Sub-sampling a large file:

from river import stream

# Only read ~10% of the rows, deterministically
for x, y in stream.iter_csv('large_data.csv', target='label', fraction=0.1, seed=42):
    print(x, y)

Without a target column:

from river import stream

for x, y in stream.iter_csv('features_only.csv'):
    print(x, y)
    # y is always None

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment