Implementation:Huggingface Datasets IterableDataset Shuffle
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, NLP |
| Last Updated | 2026-02-14 18:00 GMT |
Overview
Concrete tool for buffer-based shuffling of streaming dataset elements provided by the HuggingFace Datasets library.
Description
IterableDataset.shuffle creates a new IterableDataset backed by a BufferShuffledExamplesIterable. The operation performs two levels of randomization:
- Shard-level shuffle: The underlying data sources (shards) are permuted using
self._ex_iterable.shuffle_data_sources(generator). This reorders the files or partitions that feed the stream. - Element-level buffer shuffle: A buffer of
buffer_sizeelements is maintained. At each iteration step, a random element is sampled from the buffer and yielded, and its slot is filled by the next element from the stream.
If the dataset uses Arrow-backed iterables (iter_arrow is available), the shuffled iterable is additionally wrapped in RebatchedArrowExamplesIterable with batch_size=1 to ensure element-level granularity.
The random generator is either created from the provided seed via np.random.default_rng(seed) or deep-copied from a user-provided numpy.random.Generator.
Usage
Use IterableDataset.shuffle when you need to randomize the order of elements in a streaming dataset for training. Set a fixed seed for reproducibility, especially in distributed training setups.
Code Reference
Source Location
- Repository: datasets
- File:
src/datasets/iterable_dataset.py - Lines: L3015-L3082
Signature
def shuffle(
self, seed=None, generator: Optional[np.random.Generator] = None, buffer_size: int = 1000
) -> "IterableDataset":
Import
from datasets import load_dataset
ds = load_dataset("my_dataset", split="train", streaming=True)
# shuffle is a method on the returned IterableDataset
ds = ds.shuffle(seed=42, buffer_size=10_000)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| seed | int |
No | Random seed for reproducibility. Used for both buffer sampling and shard shuffling. Defaults to None. |
| generator | Optional[np.random.Generator] |
No | NumPy random generator. If None, created from seed via np.random.default_rng(seed).
|
| buffer_size | int |
No | Size of the shuffle buffer. Larger values produce better randomization. Defaults to 1000. |
Outputs
| Name | Type | Description |
|---|---|---|
| dataset | IterableDataset |
A new streaming dataset that yields elements in a shuffled order. |
Usage Examples
Basic Usage
from datasets import load_dataset
ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train", streaming=True)
# Original order
list(ds.take(3))
# [{'label': 1, 'text': 'the rock is destined ...'}, ...]
# Shuffled order with seed for reproducibility
shuffled_ds = ds.shuffle(seed=42)
list(shuffled_ds.take(3))
# [{'label': 1, 'text': "a sports movie with action that's exciting ..."}, ...]