Implementation:Huggingface Datasets IterableDataset Shuffle

Knowledge Sources	Huggingface Datasets HF Datasets Docs
Domains	Data_Engineering, NLP
Last Updated	2026-02-14 18:00 GMT

Overview

Concrete tool for buffer-based shuffling of streaming dataset elements provided by the HuggingFace Datasets library.

Description

IterableDataset.shuffle creates a new IterableDataset backed by a BufferShuffledExamplesIterable. The operation performs two levels of randomization:

Shard-level shuffle: The underlying data sources (shards) are permuted using self._ex_iterable.shuffle_data_sources(generator). This reorders the files or partitions that feed the stream.
Element-level buffer shuffle: A buffer of buffer_size elements is maintained. At each iteration step, a random element is sampled from the buffer and yielded, and its slot is filled by the next element from the stream.

If the dataset uses Arrow-backed iterables (iter_arrow is available), the shuffled iterable is additionally wrapped in RebatchedArrowExamplesIterable with batch_size=1 to ensure element-level granularity.

The random generator is either created from the provided seed via np.random.default_rng(seed) or deep-copied from a user-provided numpy.random.Generator.

Usage

Use IterableDataset.shuffle when you need to randomize the order of elements in a streaming dataset for training. Set a fixed seed for reproducibility, especially in distributed training setups.

Code Reference

Source Location

Repository: datasets
File: src/datasets/iterable_dataset.py
Lines: L3015-L3082

Signature

def shuffle(
    self, seed=None, generator: Optional[np.random.Generator] = None, buffer_size: int = 1000
) -> "IterableDataset":

Import

from datasets import load_dataset

ds = load_dataset("my_dataset", split="train", streaming=True)
# shuffle is a method on the returned IterableDataset
ds = ds.shuffle(seed=42, buffer_size=10_000)

I/O Contract

Inputs

Name	Type	Required	Description
seed	`int`	No	Random seed for reproducibility. Used for both buffer sampling and shard shuffling. Defaults to None.
generator	`Optional[np.random.Generator]`	No	NumPy random generator. If None, created from seed via `np.random.default_rng(seed)`.
buffer_size	`int`	No	Size of the shuffle buffer. Larger values produce better randomization. Defaults to 1000.

Outputs

Name	Type	Description
dataset	`IterableDataset`	A new streaming dataset that yields elements in a shuffled order.

Usage Examples

Basic Usage

from datasets import load_dataset

ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train", streaming=True)

# Original order
list(ds.take(3))
# [{'label': 1, 'text': 'the rock is destined ...'}, ...]

# Shuffled order with seed for reproducibility
shuffled_ds = ds.shuffle(seed=42)
list(shuffled_ds.take(3))
# [{'label': 1, 'text': "a sports movie with action that's exciting ..."}, ...]

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment