Implementation:Huggingface Datasets IterableDataset iter
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, NLP |
| Last Updated | 2026-02-14 18:00 GMT |
Overview
Concrete tool for iterating over a configured streaming dataset to yield formatted examples provided by the HuggingFace Datasets library.
Description
IterableDataset.__iter__ is the method that executes the streaming pipeline and yields individual examples to the consumer. It implements Python's iterator protocol, making the dataset usable in for loops, with next(), and as an input to PyTorch DataLoader.
The method follows this execution flow:
- PyTorch worker check: If
torchis imported and the code is running inside atorch.utils.data.DataLoaderworker, it delegates toself._iter_pytorch()which handles per-worker data sharding and returns early. - Prepare the iterable: Calls
self._prepare_ex_iterable_for_iteration()to finalize distributed and epoch settings on the internal_ex_iterable. - Formatted Arrow path: If a formatting config is set and the iterable supports
iter_arrow(or the format is table-based), it creates a format-specificFormatterviaget_formatter(), iterates over Arrow tables, and yields each row throughformatter.format_row(pa_table). - Standard path: If no Arrow fast path applies, iterates over
(key, example)pairs from the ex_iterable and yields eachexampledictionary directly. Feature decoding and formatting have already been applied by theFormattedExamplesIterablewrapper.
Usage
Use IterableDataset.__iter__ implicitly via a for loop or next(iter(ds)) to consume elements from any streaming dataset.
Code Reference
Source Location
- Repository: datasets
- File:
src/datasets/iterable_dataset.py - Lines: L2545-L2568
Signature
def __iter__(self):
Import
from datasets import load_dataset
ds = load_dataset("my_dataset", split="train", streaming=True)
# __iter__ is called implicitly by for loops
for example in ds:
process(example)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| self | IterableDataset |
Yes | The streaming dataset instance with its configured pipeline of lazy transforms and optional formatting. |
Outputs
| Name | Type | Description |
|---|---|---|
| example | dict (or formatted type) |
Each yielded element is a dictionary mapping column names to values. If with_format was called, values are converted to the specified type (e.g., torch.Tensor, np.ndarray).
|
Usage Examples
Basic Usage
from datasets import load_dataset
ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train", streaming=True)
# Iterate with a for loop
for example in ds:
print(example["text"])
break # just print the first one
# Or use next() for a single element
it = iter(ds)
first = next(it)
print(first)
# {'label': 1, 'text': 'the rock is destined to be the 21st century\'s new "conan" ...'}