Implementation:Huggingface Datatrove ContextShuffler
| Knowledge Sources | |
|---|---|
| Domains | Data Processing, Tokenization |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
DocumentTokenizerContextShuffler is a pipeline step that shuffles tokenized data at the context window level by randomly permuting fixed-size token windows within a binary `.ds` file.
Description
The DocumentTokenizerContextShuffler class extends PipelineStep to perform context-level shuffling of already-tokenized data. Rather than shuffling entire documents, it operates on fixed-size windows of tokens (defaulting to 2049 tokens, which is 2048 + 1). This granularity ensures that training batches see diverse context windows rather than sequential chunks from the same documents.
The shuffling process works by reading a `.ds.index` file to determine the total token count, computing the number of complete windows that fit within the data, generating a random permutation of window indices, and then writing the windows in the new shuffled order. It uses memory-mapped file I/O (mmap) for efficient random access into the source data file, reading each window by computing byte offsets from the window index and token size.
The class supports configurable seed values for reproducible shuffling and configurable token_size (defaulting to 2 bytes per token). It is designed for distributed execution, using rank and world_size to shard input files across workers.
Usage
Use this step after document tokenization to shuffle the resulting token files at the context window level before training. This is particularly useful for language model pre-training where shuffling at the context level helps reduce correlations between consecutive training batches.
Code Reference
Source Location
- Repository: Huggingface_Datatrove
- File: src/datatrove/pipeline/tokens/context_shuffler.py
- Lines: 1-83
Signature
class DocumentTokenizerContextShuffler(PipelineStep):
def __init__(
self,
input_folder: DataFolderLike,
output_folder: DataFolderLike,
window_size: int = 2048 + 1,
seed: int = None,
token_size: int = 2,
):
Import
from datatrove.pipeline.tokens.context_shuffler import DocumentTokenizerContextShuffler
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| input_folder | DataFolderLike | Yes | The input folder containing tokenized `.ds` and `.ds.index` files |
| output_folder | DataFolderLike | Yes | The output folder to write the shuffled `.ds` files to |
| window_size | int | No | The number of tokens per context window (default: 2049) |
| seed | int | No | Seed for the random number generator for reproducible shuffling (default: None) |
| token_size | int | No | Size of each token in bytes (default: 2) |
Outputs
| Name | Type | Description |
|---|---|---|
| output files | .ds binary files | Shuffled binary token files written to the output_folder with windows in random order |
Usage Examples
Basic Usage
from datatrove.pipeline.tokens.context_shuffler import DocumentTokenizerContextShuffler
shuffler = DocumentTokenizerContextShuffler(
input_folder="s3://my-bucket/tokenized/",
output_folder="s3://my-bucket/shuffled/",
window_size=2049,
seed=42,
)