Principle:Huggingface Datatrove Document Tokenization

Sources	Domains	Last Updated
Huggingface Datatrove, BPE Paper, HuggingFace Tokenizers	Tokenization, Training_Data	2026-02-14

Overview

Converting text documents into sequences of token IDs in binary format for efficient language model training.

Description

Document tokenization converts raw text into integer token ID sequences using a pre-trained tokenizer (e.g., GPT-2 BPE). The process operates in batches for throughput and produces binary .ds files containing contiguous token arrays and .ds.index files tracking document boundaries (end positions in tokens). This binary format enables zero-copy or near-zero-copy loading during training, avoiding repeated tokenization.

Key capabilities include:

EOS token insertion -- optionally appends an end-of-sequence token after each document via post-processing
Within-file document shuffling -- randomizes document order within each output file to improve training data mixing
Chunked shuffling -- groups documents into fixed-size token chunks and shuffles at the chunk level for large datasets where per-document shuffling is too fine-grained
Loss masking -- optionally produces .ds.loss files with per-token binary masks to exclude certain text regions (e.g., prompts) from loss computation
Metadata files -- saves tokenizer name and total token count for downstream verification

Usage

When preparing text data for language model pre-training, after all filtering and deduplication steps have been applied. This is typically the first stage of a two-stage tokenization pipeline, followed by DocumentTokenizerMerger to consolidate distributed outputs.

Theoretical Basis

Subword tokenization (BPE/Unigram) maps variable-length text spans to fixed integer IDs, enabling neural language models to handle open vocabularies. Binary serialization of token sequences eliminates the need for repeated tokenization at training time. The token size (2 bytes for vocabularies up to 65535, 4 bytes for larger) is determined automatically from the tokenizer vocabulary size using numpy integer type bounds.

Related Pages

Implementation:Huggingface_Datatrove_DocumentTokenizer

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment