Principle:Huggingface Datasets Streaming Shuffle
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, NLP |
| Last Updated | 2026-02-14 18:00 GMT |
Overview
Buffer-based shuffling for streaming datasets maintains randomization of element order without requiring full materialization of the dataset in memory.
Description
Shuffling a streaming dataset presents a fundamental challenge: the entire dataset is never available at once, so a global random permutation is impossible. Buffer-based shuffling addresses this by maintaining a fixed-size buffer of elements and randomly sampling from it. As each element is sampled and yielded, it is replaced by the next element from the stream, maintaining the buffer at its configured size.
The quality of the shuffle depends on the buffer size:
- Buffer size equal to dataset size: Equivalent to a perfect global shuffle. Every element has an equal probability of appearing at any position.
- Buffer size of 1: No shuffling at all; elements are yielded in their original order.
- Buffer size between 1 and dataset size: Provides approximate shuffling with a locality bias. Elements can move at most
buffer_sizepositions from their original location.
In addition to element-level buffer shuffling, the operation also shuffles the order of data shards (the underlying files or partitions). This shard-level shuffling provides coarse-grained randomization that complements the fine-grained buffer shuffle. However, if the shard order has been fixed by a prior skip or take operation, it is preserved.
Reproducibility is achieved through a seed parameter that controls both the buffer sampling and the shard ordering. This is particularly important in distributed training where all nodes must agree on the shard assignment.
Usage
Use streaming shuffle when:
- You are training a model on a streaming dataset and need to avoid ordering bias.
- You want approximate randomization without loading the full dataset into memory.
- You need reproducible shuffling across multiple runs or distributed nodes (by fixing the seed).
- You are combining shuffle with other lazy operations like map, filter, take, and skip.
Theoretical Basis
Buffer-based shuffling is an instance of reservoir sampling adapted for streaming contexts. The classic reservoir sampling algorithm (Vitter, 1985) maintains a fixed-size sample from a stream of unknown length, ensuring each element has an equal probability of being in the sample. The buffer shuffle variant extends this: instead of sampling for inclusion, it samples for the order of emission.
The algorithm is a variation of the Fisher-Yates shuffle applied incrementally. At each step, a random index in the buffer is selected, the element at that index is yielded, and the vacancy is filled by the next stream element. This ensures O(1) memory per buffered element and O(1) time per yielded element.
The two-level approach (shard-level permutation plus element-level buffer) is analogous to block randomization in experimental design: shards provide coarse randomization units, while the buffer provides fine-grained mixing within and across shard boundaries.