Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Datasets IterableDataset iter

From Leeroopedia
Knowledge Sources
Domains Data_Engineering, NLP
Last Updated 2026-02-14 18:00 GMT

Overview

Concrete tool for iterating over a configured streaming dataset to yield formatted examples provided by the HuggingFace Datasets library.

Description

IterableDataset.__iter__ is the method that executes the streaming pipeline and yields individual examples to the consumer. It implements Python's iterator protocol, making the dataset usable in for loops, with next(), and as an input to PyTorch DataLoader.

The method follows this execution flow:

  1. PyTorch worker check: If torch is imported and the code is running inside a torch.utils.data.DataLoader worker, it delegates to self._iter_pytorch() which handles per-worker data sharding and returns early.
  2. Prepare the iterable: Calls self._prepare_ex_iterable_for_iteration() to finalize distributed and epoch settings on the internal _ex_iterable.
  3. Formatted Arrow path: If a formatting config is set and the iterable supports iter_arrow (or the format is table-based), it creates a format-specific Formatter via get_formatter(), iterates over Arrow tables, and yields each row through formatter.format_row(pa_table).
  4. Standard path: If no Arrow fast path applies, iterates over (key, example) pairs from the ex_iterable and yields each example dictionary directly. Feature decoding and formatting have already been applied by the FormattedExamplesIterable wrapper.

Usage

Use IterableDataset.__iter__ implicitly via a for loop or next(iter(ds)) to consume elements from any streaming dataset.

Code Reference

Source Location

  • Repository: datasets
  • File: src/datasets/iterable_dataset.py
  • Lines: L2545-L2568

Signature

def __iter__(self):

Import

from datasets import load_dataset

ds = load_dataset("my_dataset", split="train", streaming=True)
# __iter__ is called implicitly by for loops
for example in ds:
    process(example)

I/O Contract

Inputs

Name Type Required Description
self IterableDataset Yes The streaming dataset instance with its configured pipeline of lazy transforms and optional formatting.

Outputs

Name Type Description
example dict (or formatted type) Each yielded element is a dictionary mapping column names to values. If with_format was called, values are converted to the specified type (e.g., torch.Tensor, np.ndarray).

Usage Examples

Basic Usage

from datasets import load_dataset

ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train", streaming=True)

# Iterate with a for loop
for example in ds:
    print(example["text"])
    break  # just print the first one

# Or use next() for a single element
it = iter(ds)
first = next(it)
print(first)
# {'label': 1, 'text': 'the rock is destined to be the 21st century\'s new "conan" ...'}

Related Pages

Implements Principle

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment