Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Datasets Dataset Shuffle

From Leeroopedia
Knowledge Sources
Domains Data_Engineering, ML_Preprocessing
Last Updated 2026-02-14 18:00 GMT

Overview

Concrete tool for randomly reordering dataset rows provided by the HuggingFace Datasets library.

Description

The shuffle method creates a new dataset with rows in a random order by computing a random permutation of the indices using NumPy's random number generator. The shuffled order is represented as an indices mapping over the original data, which is fast to create but may reduce sequential read performance (up to 10x slower). To restore read speed after shuffling, you can call flatten_indices() to physically rewrite the data in the shuffled order. The method supports reproducibility via seed or a custom np.random.Generator.

Usage

Use Dataset.shuffle when you need to randomize the order of examples before training, to break inherent data ordering that could bias gradient updates, or when you need reproducible random orderings across runs.

Code Reference

Source Location

  • Repository: datasets
  • File: src/datasets/arrow_dataset.py
  • Lines: L4503-L4633

Signature

@transmit_format
@fingerprint_transform(
    inplace=False, randomized_function=True, ignore_kwargs=["load_from_cache_file", "indices_cache_file_name"]
)
def shuffle(
    self,
    seed: Optional[int] = None,
    generator: Optional[np.random.Generator] = None,
    keep_in_memory: bool = False,
    load_from_cache_file: Optional[bool] = None,
    indices_cache_file_name: Optional[str] = None,
    writer_batch_size: Optional[int] = 1000,
    new_fingerprint: Optional[str] = None,
) -> "Dataset":

Import

from datasets import load_dataset

ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation")
ds = ds.shuffle(seed=42)

I/O Contract

Inputs

Name Type Required Description
seed Optional[int] No Seed for the random number generator. If None, entropy is pulled from the OS.
generator Optional[np.random.Generator] No NumPy random Generator to use. Cannot be provided together with seed.
keep_in_memory bool No Keep shuffled indices in memory. Defaults to False.
load_from_cache_file Optional[bool] No Use cached shuffled indices if available.
indices_cache_file_name Optional[str] No Cache file path for shuffled indices.
writer_batch_size Optional[int] No Rows per write operation. Defaults to 1000.
new_fingerprint Optional[str] No The new fingerprint after transform.

Outputs

Name Type Description
return Dataset A new dataset with rows in a randomly shuffled order.

Usage Examples

Basic Usage

from datasets import load_dataset

ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation")
print(ds["label"][:10])
# [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

# Shuffle with a seed for reproducibility
shuffled_ds = ds.shuffle(seed=42)
print(shuffled_ds["label"][:10])
# [1, 0, 1, 1, 0, 0, 0, 0, 0, 0]

Related Pages

Implements Principle

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment