Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Datatrove ContextShuffler

From Leeroopedia
Knowledge Sources
Domains Data Processing, Tokenization
Last Updated 2026-02-14 17:00 GMT

Overview

DocumentTokenizerContextShuffler is a pipeline step that shuffles tokenized data at the context window level by randomly permuting fixed-size token windows within a binary `.ds` file.

Description

The DocumentTokenizerContextShuffler class extends PipelineStep to perform context-level shuffling of already-tokenized data. Rather than shuffling entire documents, it operates on fixed-size windows of tokens (defaulting to 2049 tokens, which is 2048 + 1). This granularity ensures that training batches see diverse context windows rather than sequential chunks from the same documents.

The shuffling process works by reading a `.ds.index` file to determine the total token count, computing the number of complete windows that fit within the data, generating a random permutation of window indices, and then writing the windows in the new shuffled order. It uses memory-mapped file I/O (mmap) for efficient random access into the source data file, reading each window by computing byte offsets from the window index and token size.

The class supports configurable seed values for reproducible shuffling and configurable token_size (defaulting to 2 bytes per token). It is designed for distributed execution, using rank and world_size to shard input files across workers.

Usage

Use this step after document tokenization to shuffle the resulting token files at the context window level before training. This is particularly useful for language model pre-training where shuffling at the context level helps reduce correlations between consecutive training batches.

Code Reference

Source Location

Signature

class DocumentTokenizerContextShuffler(PipelineStep):
    def __init__(
        self,
        input_folder: DataFolderLike,
        output_folder: DataFolderLike,
        window_size: int = 2048 + 1,
        seed: int = None,
        token_size: int = 2,
    ):

Import

from datatrove.pipeline.tokens.context_shuffler import DocumentTokenizerContextShuffler

I/O Contract

Inputs

Name Type Required Description
input_folder DataFolderLike Yes The input folder containing tokenized `.ds` and `.ds.index` files
output_folder DataFolderLike Yes The output folder to write the shuffled `.ds` files to
window_size int No The number of tokens per context window (default: 2049)
seed int No Seed for the random number generator for reproducible shuffling (default: None)
token_size int No Size of each token in bytes (default: 2)

Outputs

Name Type Description
output files .ds binary files Shuffled binary token files written to the output_folder with windows in random order

Usage Examples

Basic Usage

from datatrove.pipeline.tokens.context_shuffler import DocumentTokenizerContextShuffler

shuffler = DocumentTokenizerContextShuffler(
    input_folder="s3://my-bucket/tokenized/",
    output_folder="s3://my-bucket/shuffled/",
    window_size=2049,
    seed=42,
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment