Principle:Huggingface Datatrove Document Tokenization
| Sources | Domains | Last Updated |
|---|---|---|
| Huggingface Datatrove, BPE Paper, HuggingFace Tokenizers | Tokenization, Training_Data | 2026-02-14 |
Overview
Converting text documents into sequences of token IDs in binary format for efficient language model training.
Description
Document tokenization converts raw text into integer token ID sequences using a pre-trained tokenizer (e.g., GPT-2 BPE). The process operates in batches for throughput and produces binary .ds files containing contiguous token arrays and .ds.index files tracking document boundaries (end positions in tokens). This binary format enables zero-copy or near-zero-copy loading during training, avoiding repeated tokenization.
Key capabilities include:
- EOS token insertion -- optionally appends an end-of-sequence token after each document via post-processing
- Within-file document shuffling -- randomizes document order within each output file to improve training data mixing
- Chunked shuffling -- groups documents into fixed-size token chunks and shuffles at the chunk level for large datasets where per-document shuffling is too fine-grained
- Loss masking -- optionally produces .ds.loss files with per-token binary masks to exclude certain text regions (e.g., prompts) from loss computation
- Metadata files -- saves tokenizer name and total token count for downstream verification
Usage
When preparing text data for language model pre-training, after all filtering and deduplication steps have been applied. This is typically the first stage of a two-stage tokenization pipeline, followed by DocumentTokenizerMerger to consolidate distributed outputs.
Theoretical Basis
Subword tokenization (BPE/Unigram) maps variable-length text spans to fixed integer IDs, enabling neural language models to handle open vocabularies. Binary serialization of token sequences eliminates the need for repeated tokenization at training time. The token size (2 bytes for vocabularies up to 65535, 4 bytes for larger) is determined automatically from the tokenizer vocabulary size using numpy integer type bounds.