Implementation:Huggingface Datatrove ContextShuffler

Knowledge Sources	Huggingface_Datatrove
Domains	Data Processing, Tokenization
Last Updated	2026-02-14 17:00 GMT

Overview

DocumentTokenizerContextShuffler is a pipeline step that shuffles tokenized data at the context window level by randomly permuting fixed-size token windows within a binary `.ds` file.

Description

The DocumentTokenizerContextShuffler class extends PipelineStep to perform context-level shuffling of already-tokenized data. Rather than shuffling entire documents, it operates on fixed-size windows of tokens (defaulting to 2049 tokens, which is 2048 + 1). This granularity ensures that training batches see diverse context windows rather than sequential chunks from the same documents.

The shuffling process works by reading a `.ds.index` file to determine the total token count, computing the number of complete windows that fit within the data, generating a random permutation of window indices, and then writing the windows in the new shuffled order. It uses memory-mapped file I/O (mmap) for efficient random access into the source data file, reading each window by computing byte offsets from the window index and token size.

The class supports configurable seed values for reproducible shuffling and configurable token_size (defaulting to 2 bytes per token). It is designed for distributed execution, using rank and world_size to shard input files across workers.

Usage

Use this step after document tokenization to shuffle the resulting token files at the context window level before training. This is particularly useful for language model pre-training where shuffling at the context level helps reduce correlations between consecutive training batches.

Code Reference

Source Location

Repository: Huggingface_Datatrove
File: src/datatrove/pipeline/tokens/context_shuffler.py
Lines: 1-83

Signature

class DocumentTokenizerContextShuffler(PipelineStep):
    def __init__(
        self,
        input_folder: DataFolderLike,
        output_folder: DataFolderLike,
        window_size: int = 2048 + 1,
        seed: int = None,
        token_size: int = 2,
    ):

Import

from datatrove.pipeline.tokens.context_shuffler import DocumentTokenizerContextShuffler

I/O Contract

Inputs

Name	Type	Required	Description
input_folder	DataFolderLike	Yes	The input folder containing tokenized `.ds` and `.ds.index` files
output_folder	DataFolderLike	Yes	The output folder to write the shuffled `.ds` files to
window_size	int	No	The number of tokens per context window (default: 2049)
seed	int	No	Seed for the random number generator for reproducible shuffling (default: None)
token_size	int	No	Size of each token in bytes (default: 2)

Outputs

Name	Type	Description
output files	.ds binary files	Shuffled binary token files written to the output_folder with windows in random order

Usage Examples

Basic Usage

from datatrove.pipeline.tokens.context_shuffler import DocumentTokenizerContextShuffler

shuffler = DocumentTokenizerContextShuffler(
    input_folder="s3://my-bucket/tokenized/",
    output_folder="s3://my-bucket/shuffled/",
    window_size=2049,
    seed=42,
)

Related Pages

Principle:Huggingface_Datatrove_Context_Window_Shuffling

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment