Implementation:Huggingface Datatrove DocumentTokenizer

Sources	Domains	Last Updated
Huggingface Datatrove	Tokenization, Training_Data	2026-02-14

Overview

Pipeline step that tokenizes documents using HuggingFace fast tokenizers and writes binary .ds token files with optional shuffling and loss masking.

Description

DocumentTokenizer extends PipelineStepWithTokenizer and implements the full tokenization-to-disk pipeline. It batch-tokenizes document text using tokenizer.encode_batch, writes raw token bytes to an unshuffled .ds file via the TokenizedFile helper class, then optionally performs two levels of shuffling: document-level (randomizing document order) and chunk-level (grouping tokens into fixed-size chunks and shuffling those). Each shuffle step produces a new file and removes the previous one.

The TokenizedFile helper manages writing tokens as packed binary structs (H for uint16 or I for uint32), maintaining document boundary indexes, and performing the actual shuffle-copy operation by seeking to document positions in random order.

Usage

Use as a pipeline step after all text processing is complete. Typically run in parallel with multiple ranks, each producing its own output files. Follow with DocumentTokenizerMerger to consolidate.

Code Reference

Source Location: Repository: huggingface/datatrove, File: src/datatrove/pipeline/tokens/tokenizer.py (L281-475)

Signature:

class DocumentTokenizer(PipelineStepWithTokenizer):
    def __init__(
        self,
        output_folder: DataFolderLike,
        tokenizer_name_or_path: str,
        local_working_dir: DataFolderLike | None = None,
        save_filename: str | None = None,
        eos_token: str | None = None,
        save_index: bool = True,
        save_loss_metadata: bool = False,
        save_final_metadata: bool = True,
        batch_size: int = 10000,
        max_tokens_per_file: int | None = None,
        seed: int | None = None,
        upload_block_size: int | None = None,
        shuffle_documents: bool = True,
        shuffle_chunk_size: int | None = None,
    ):

Import:

from datatrove.pipeline.tokens import DocumentTokenizer

I/O Contract

Inputs:

Parameter	Type	Required	Description
output_folder	DataFolderLike	Yes	Folder where binary token files are written
tokenizer_name_or_path	str	Yes	HuggingFace tokenizer name or local file path
local_working_dir	DataFolderLike or None	No	Local directory for temporary shuffle files (recommended for remote output)
save_filename	str or None	No	Base filename for output files (default None)
eos_token	str or None	No	endoftext\|>")
save_index	bool	No	Save document boundary index files (default True)
save_loss_metadata	bool	No	Save per-token loss masks (default False)
save_final_metadata	bool	No	Save metadata file with tokenizer name and token count (default True)
batch_size	int	No	Documents per tokenization batch (default 10000)
max_tokens_per_file	int or None	No	Split shuffled output at this token count (default None)
seed	int or None	No	Random seed for shuffling (default None)
upload_block_size	int or None	No	fsspec upload block size for remote storage (default None)
shuffle_documents	bool	No	Shuffle document order within each file (default True)
shuffle_chunk_size	int or None	No	Token chunk size for chunk-level shuffling (default None)

Outputs:

Binary .ds files -- contiguous packed token arrays (uint16 or uint32)
.ds.index files -- uint64 document boundary positions (in tokens)
.ds.loss files (optional) -- per-token boolean loss masks
.ds.metadata files (optional) -- tokenizer name and total token count

Usage Examples

Example 1 -- Basic tokenization with GPT-2:

from datatrove.pipeline.tokens import DocumentTokenizer

tokenizer = DocumentTokenizer(
    output_folder="/data/tokens/",
    tokenizer_name_or_path="gpt2",
    eos_token="<|endoftext|>",
)

Example 2 -- Tokenization with chunk shuffling for large datasets:

from datatrove.pipeline.tokens import DocumentTokenizer

tokenizer = DocumentTokenizer(
    output_folder="s3://my-bucket/tokens/",
    tokenizer_name_or_path="gpt2",
    local_working_dir="/tmp/tokenizer_work/",
    shuffle_documents=True,
    shuffle_chunk_size=2048 * 8192,
    max_tokens_per_file=1024 * 1024 * 1024,
    seed=42,
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment