Principle:Huggingface Datasets Stream Iteration
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, NLP |
| Last Updated | 2026-02-14 18:00 GMT |
Overview
Iterating over a configured streaming dataset yields formatted examples by executing the full lazy pipeline of transformations on demand.
Description
Stream iteration is the execution point of the entire streaming pipeline. All preceding operations -- loading, mapping, filtering, shuffling, taking, skipping, and format configuration -- are definitions of a lazy pipeline. None of them fetch data or execute transformations. It is only when the consumer calls __iter__ (typically via a for loop or next()) that the pipeline materializes elements one at a time.
The iteration process involves several stages:
- PyTorch worker detection: If the dataset is being consumed inside a PyTorch
DataLoaderworker process, iteration delegates to_iter_pytorch, which handles data sharding across workers. - Pipeline preparation: The internal
_prepare_ex_iterable_for_iterationmethod finalizes the iterable chain, applying any pending distributed or epoch-based configuration. - Arrow-based fast path: If a format is set and the underlying iterable supports Arrow iteration (
iter_arrow), the pipeline uses Arrow tables for efficient batch processing. A format-specificFormatterconverts each Arrow row to the target type (e.g., PyTorch tensor). - Standard path: If no Arrow fast path is available, examples are yielded as Python dictionaries from a
FormattedExamplesIterablethat handles feature decoding and formatting.
The iteration protocol ensures that each example passes through the complete transformation chain exactly once, in order, with no buffering beyond what is required by intermediate operations (e.g., the shuffle buffer).
Usage
Use stream iteration when:
- You are consuming streaming dataset elements in a training loop, evaluation loop, or data processing pipeline.
- You are using a
forloop,next(), or passing the dataset to a PyTorchDataLoader. - You want to trigger the execution of a lazily-defined pipeline of map, filter, shuffle, and format operations.
Theoretical Basis
Stream iteration implements the pull-based evaluation model. In contrast to push-based systems (where the producer drives the flow), the consumer explicitly requests each element by calling __next__. This model is embodied by Python's iterator protocol (__iter__ / __next__) and is the foundation of Python generators.
The lazy pipeline follows the pipeline pattern (also called pipes and filters). Each stage in the pipeline (map, filter, shuffle, format) is a filter that transforms or selects elements, connected by implicit channels (Python generator yield). The pipeline executes in a streaming fashion: each element flows through all stages before the next element enters the pipeline, minimizing memory usage.
The PyTorch integration via torch.utils.data.IterableDataset compatibility enables the dataset to participate in PyTorch's multi-process data loading architecture, where each worker process iterates over a shard of the data independently.