Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Huggingface Datasets IterableDataset Shuffle

From Leeroopedia
Knowledge Sources
Domains Data_Engineering, NLP
Last Updated 2026-02-14 18:00 GMT

Overview

Concrete tool for buffer-based shuffling of streaming dataset elements provided by the HuggingFace Datasets library.

Description

IterableDataset.shuffle creates a new IterableDataset backed by a BufferShuffledExamplesIterable. The operation performs two levels of randomization:

  1. Shard-level shuffle: The underlying data sources (shards) are permuted using self._ex_iterable.shuffle_data_sources(generator). This reorders the files or partitions that feed the stream.
  2. Element-level buffer shuffle: A buffer of buffer_size elements is maintained. At each iteration step, a random element is sampled from the buffer and yielded, and its slot is filled by the next element from the stream.

If the dataset uses Arrow-backed iterables (iter_arrow is available), the shuffled iterable is additionally wrapped in RebatchedArrowExamplesIterable with batch_size=1 to ensure element-level granularity.

The random generator is either created from the provided seed via np.random.default_rng(seed) or deep-copied from a user-provided numpy.random.Generator.

Usage

Use IterableDataset.shuffle when you need to randomize the order of elements in a streaming dataset for training. Set a fixed seed for reproducibility, especially in distributed training setups.

Code Reference

Source Location

  • Repository: datasets
  • File: src/datasets/iterable_dataset.py
  • Lines: L3015-L3082

Signature

def shuffle(
    self, seed=None, generator: Optional[np.random.Generator] = None, buffer_size: int = 1000
) -> "IterableDataset":

Import

from datasets import load_dataset

ds = load_dataset("my_dataset", split="train", streaming=True)
# shuffle is a method on the returned IterableDataset
ds = ds.shuffle(seed=42, buffer_size=10_000)

I/O Contract

Inputs

Name Type Required Description
seed int No Random seed for reproducibility. Used for both buffer sampling and shard shuffling. Defaults to None.
generator Optional[np.random.Generator] No NumPy random generator. If None, created from seed via np.random.default_rng(seed).
buffer_size int No Size of the shuffle buffer. Larger values produce better randomization. Defaults to 1000.

Outputs

Name Type Description
dataset IterableDataset A new streaming dataset that yields elements in a shuffled order.

Usage Examples

Basic Usage

from datasets import load_dataset

ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train", streaming=True)

# Original order
list(ds.take(3))
# [{'label': 1, 'text': 'the rock is destined ...'}, ...]

# Shuffled order with seed for reproducibility
shuffled_ds = ds.shuffle(seed=42)
list(shuffled_ds.take(3))
# [{'label': 1, 'text': "a sports movie with action that's exciting ..."}, ...]

Related Pages

Implements Principle

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment