Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Huggingface Datatrove Token File Merging

From Leeroopedia
Sources Domains Last Updated
Huggingface Datatrove Tokenization, Training_Data 2026-02-14

Overview

Merging and shuffling distributed tokenized files into consolidated training-ready binary files.

Description

After distributed tokenization produces many small .ds files (one per worker rank), merging combines them into larger files suitable for training. The merger reads document boundary indexes from all input files, generates a global shuffled ordering across files, and copies token bytes from the input files into consolidated output files following that ordering.

Key properties of the merging process:

  • Cross-file document shuffling -- documents from different input files are interleaved in a random order to ensure training data is well-mixed across the sources that different workers processed
  • Configurable output file size -- the max_tokens_per_file parameter controls when to start a new output file, allowing production of training-sized shards
  • Total token limits -- the max_tokens parameter can cap the total tokens processed, useful for creating fixed-size training sets
  • Chunk-level shuffling -- when shuffle_chunk_size is set, instead of shuffling individual documents, fixed-size token chunks are shuffled, which can be more efficient for very large datasets
  • Metadata preservation -- tokenizer name and token size are read from the first input file's metadata and propagated to output files

Usage

As the second stage of a two-stage tokenization pipeline, after distributed DocumentTokenizer runs. Must run with world_size=1 since it consolidates outputs from all workers.

Theoretical Basis

Cross-file shuffling addresses the correlation problem that arises when workers process data in order: documents from similar sources or time periods cluster together within each worker's output. By globally shuffling documents across all worker outputs, the merged files provide better training data mixing, which is important for stable language model training and avoiding periodic loss spikes.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment