Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Huggingface Datasets Streaming Shuffle

From Leeroopedia
Knowledge Sources
Domains Data_Engineering, NLP
Last Updated 2026-02-14 18:00 GMT

Overview

Buffer-based shuffling for streaming datasets maintains randomization of element order without requiring full materialization of the dataset in memory.

Description

Shuffling a streaming dataset presents a fundamental challenge: the entire dataset is never available at once, so a global random permutation is impossible. Buffer-based shuffling addresses this by maintaining a fixed-size buffer of elements and randomly sampling from it. As each element is sampled and yielded, it is replaced by the next element from the stream, maintaining the buffer at its configured size.

The quality of the shuffle depends on the buffer size:

  • Buffer size equal to dataset size: Equivalent to a perfect global shuffle. Every element has an equal probability of appearing at any position.
  • Buffer size of 1: No shuffling at all; elements are yielded in their original order.
  • Buffer size between 1 and dataset size: Provides approximate shuffling with a locality bias. Elements can move at most buffer_size positions from their original location.

In addition to element-level buffer shuffling, the operation also shuffles the order of data shards (the underlying files or partitions). This shard-level shuffling provides coarse-grained randomization that complements the fine-grained buffer shuffle. However, if the shard order has been fixed by a prior skip or take operation, it is preserved.

Reproducibility is achieved through a seed parameter that controls both the buffer sampling and the shard ordering. This is particularly important in distributed training where all nodes must agree on the shard assignment.

Usage

Use streaming shuffle when:

  • You are training a model on a streaming dataset and need to avoid ordering bias.
  • You want approximate randomization without loading the full dataset into memory.
  • You need reproducible shuffling across multiple runs or distributed nodes (by fixing the seed).
  • You are combining shuffle with other lazy operations like map, filter, take, and skip.

Theoretical Basis

Buffer-based shuffling is an instance of reservoir sampling adapted for streaming contexts. The classic reservoir sampling algorithm (Vitter, 1985) maintains a fixed-size sample from a stream of unknown length, ensuring each element has an equal probability of being in the sample. The buffer shuffle variant extends this: instead of sampling for inclusion, it samples for the order of emission.

The algorithm is a variation of the Fisher-Yates shuffle applied incrementally. At each step, a random index in the buffer is selected, the element at that index is yielded, and the vacancy is filled by the next stream element. This ensures O(1) memory per buffered element and O(1) time per yielded element.

The two-level approach (shard-level permutation plus element-level buffer) is analogous to block randomization in experimental design: shards provide coarse randomization units, while the buffer provides fine-grained mixing within and across shard boundaries.

Related Pages

Implemented By

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment