Workflow:Huggingface Datasets Dataset Streaming
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, Machine_Learning, Large_Scale_Data |
| Last Updated | 2026-02-14 18:00 GMT |
Overview
End-to-end process for loading and processing datasets in streaming mode, enabling iteration over arbitrarily large datasets without downloading the full data to disk.
Description
This workflow covers streaming-mode dataset usage, where data is fetched and processed lazily as an IterableDataset rather than materialized into Arrow files on disk. Streaming mode monkey-patches standard Python file I/O operations with fsspec-backed remote-capable equivalents, allowing dataset builder scripts to read directly from remote storage (HTTP, S3, GCS, etc.) without modification. The result is a lazy generator chain that applies transformations (map, filter, shuffle, skip, take) on-the-fly during iteration, making it suitable for datasets that exceed available disk space or when immediate access is needed without waiting for a full download.
Usage
Execute this workflow when working with very large datasets (terabytes of data), when disk space is limited, when you want to start iterating immediately without waiting for the full download, or when you need to sample a small portion of a large dataset for prototyping. Streaming is particularly useful for web-scale datasets like Common Crawl, ImageNet-scale image datasets, or multi-language corpora.
Execution Steps
Step 1: Enable Streaming Mode
Activate streaming by setting the streaming parameter to True when calling the dataset loader. This instructs the system to return an IterableDataset (or IterableDatasetDict) instead of a map-style Dataset. No data is downloaded at this point; the system only resolves the dataset module and prepares the lazy generator pipeline.
Key considerations:
- Streaming mode bypasses the download-and-prepare step entirely
- The StreamingDownloadManager returns URLs instead of local file paths
- Standard file I/O functions are monkey-patched with remote-capable equivalents via fsspec
- Data files are read on-the-fly as the iterator advances
Step 2: Apply Lazy Transformations
Chain transformations onto the IterableDataset using map, filter, and other lazy operations. These transformations are not executed immediately but are recorded as a pipeline of operations that will be applied during iteration. Each transformation returns a new IterableDataset wrapping the previous one with the additional operation.
Key considerations:
- map() applies a function to each example lazily during iteration
- filter() removes examples that do not match a condition
- shuffle() uses a buffer-based approach: fills a buffer of configurable size and randomly samples from it
- take(n) limits iteration to the first n examples (useful for prototyping)
- skip(n) skips the first n examples
Step 3: Configure Output Format
Set the output format for the streamed data to match your ML framework. The with_format method configures automatic conversion of yielded examples to PyTorch tensors, TensorFlow tensors, NumPy arrays, or other supported formats during iteration.
Key considerations:
- Supported formats include torch, tensorflow, numpy, jax, pandas, polars, and arrow
- Format conversion happens on-the-fly during iteration, not in advance
- Only specified columns are converted; others remain as Python objects
Step 4: Iterate Over the Stream
Consume the dataset by iterating over the IterableDataset in a training loop or data pipeline. Each iteration step triggers the lazy evaluation chain: fetching data from the remote source, applying all chained transformations, converting to the target format, and yielding the result.
Key considerations:
- Iteration triggers actual HTTP requests and data processing
- Data is processed one example (or batch) at a time, keeping memory usage constant
- The iterator can be restarted from the beginning by creating a new iteration
- For distributed training, use split_dataset_by_node to partition shards across workers
Step 5: Distribute Across Workers
For multi-node or multi-worker training setups, partition the stream across workers so each worker processes a distinct subset of the data. The distribution system assigns shards to workers when possible, or falls back to round-robin example assignment when shard counts do not evenly divide.
Key considerations:
- When num_shards is divisible by world_size, each worker gets a disjoint set of shards
- Otherwise, each worker receives every world_size-th example
- A fixed seed in shuffle() is required for consistent distributed partitioning
- Compatible with PyTorch DataLoader with multiple workers