Implementation:Huggingface Datasets Dataset Shuffle
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, ML_Preprocessing |
| Last Updated | 2026-02-14 18:00 GMT |
Overview
Concrete tool for randomly reordering dataset rows provided by the HuggingFace Datasets library.
Description
The shuffle method creates a new dataset with rows in a random order by computing a random permutation of the indices using NumPy's random number generator. The shuffled order is represented as an indices mapping over the original data, which is fast to create but may reduce sequential read performance (up to 10x slower). To restore read speed after shuffling, you can call flatten_indices() to physically rewrite the data in the shuffled order. The method supports reproducibility via seed or a custom np.random.Generator.
Usage
Use Dataset.shuffle when you need to randomize the order of examples before training, to break inherent data ordering that could bias gradient updates, or when you need reproducible random orderings across runs.
Code Reference
Source Location
- Repository: datasets
- File:
src/datasets/arrow_dataset.py - Lines: L4503-L4633
Signature
@transmit_format
@fingerprint_transform(
inplace=False, randomized_function=True, ignore_kwargs=["load_from_cache_file", "indices_cache_file_name"]
)
def shuffle(
self,
seed: Optional[int] = None,
generator: Optional[np.random.Generator] = None,
keep_in_memory: bool = False,
load_from_cache_file: Optional[bool] = None,
indices_cache_file_name: Optional[str] = None,
writer_batch_size: Optional[int] = 1000,
new_fingerprint: Optional[str] = None,
) -> "Dataset":
Import
from datasets import load_dataset
ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation")
ds = ds.shuffle(seed=42)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| seed | Optional[int] |
No | Seed for the random number generator. If None, entropy is pulled from the OS.
|
| generator | Optional[np.random.Generator] |
No | NumPy random Generator to use. Cannot be provided together with seed.
|
| keep_in_memory | bool |
No | Keep shuffled indices in memory. Defaults to False.
|
| load_from_cache_file | Optional[bool] |
No | Use cached shuffled indices if available. |
| indices_cache_file_name | Optional[str] |
No | Cache file path for shuffled indices. |
| writer_batch_size | Optional[int] |
No | Rows per write operation. Defaults to 1000. |
| new_fingerprint | Optional[str] |
No | The new fingerprint after transform. |
Outputs
| Name | Type | Description |
|---|---|---|
| return | Dataset |
A new dataset with rows in a randomly shuffled order. |
Usage Examples
Basic Usage
from datasets import load_dataset
ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation")
print(ds["label"][:10])
# [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
# Shuffle with a seed for reproducibility
shuffled_ds = ds.shuffle(seed=42)
print(shuffled_ds["label"][:10])
# [1, 0, 1, 1, 0, 0, 0, 0, 0, 0]