Principle:Huggingface Datatrove Token File Merging

Sources	Domains	Last Updated
Huggingface Datatrove	Tokenization, Training_Data	2026-02-14

Overview

Merging and shuffling distributed tokenized files into consolidated training-ready binary files.

Description

After distributed tokenization produces many small .ds files (one per worker rank), merging combines them into larger files suitable for training. The merger reads document boundary indexes from all input files, generates a global shuffled ordering across files, and copies token bytes from the input files into consolidated output files following that ordering.

Key properties of the merging process:

Cross-file document shuffling -- documents from different input files are interleaved in a random order to ensure training data is well-mixed across the sources that different workers processed
Configurable output file size -- the max_tokens_per_file parameter controls when to start a new output file, allowing production of training-sized shards
Total token limits -- the max_tokens parameter can cap the total tokens processed, useful for creating fixed-size training sets
Chunk-level shuffling -- when shuffle_chunk_size is set, instead of shuffling individual documents, fixed-size token chunks are shuffled, which can be more efficient for very large datasets
Metadata preservation -- tokenizer name and token size are read from the first input file's metadata and propagated to output files

Usage

As the second stage of a two-stage tokenization pipeline, after distributed DocumentTokenizer runs. Must run with world_size=1 since it consolidates outputs from all workers.

Theoretical Basis

Cross-file shuffling addresses the correlation problem that arises when workers process data in order: documents from similar sources or time periods cluster together within each worker's output. By globally shuffling documents across all worker outputs, the merged files provide better training data mixing, which is important for stable language model training and avoiding periodic loss spikes.

Related Pages

Implementation:Huggingface_Datatrove_DocumentTokenizerMerger

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment