Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Huggingface Datatrove DocumentTokenizer

From Leeroopedia
Sources Domains Last Updated
Huggingface Datatrove Tokenization, Training_Data 2026-02-14

Overview

Pipeline step that tokenizes documents using HuggingFace fast tokenizers and writes binary .ds token files with optional shuffling and loss masking.

Description

DocumentTokenizer extends PipelineStepWithTokenizer and implements the full tokenization-to-disk pipeline. It batch-tokenizes document text using tokenizer.encode_batch, writes raw token bytes to an unshuffled .ds file via the TokenizedFile helper class, then optionally performs two levels of shuffling: document-level (randomizing document order) and chunk-level (grouping tokens into fixed-size chunks and shuffling those). Each shuffle step produces a new file and removes the previous one.

The TokenizedFile helper manages writing tokens as packed binary structs (H for uint16 or I for uint32), maintaining document boundary indexes, and performing the actual shuffle-copy operation by seeking to document positions in random order.

Usage

Use as a pipeline step after all text processing is complete. Typically run in parallel with multiple ranks, each producing its own output files. Follow with DocumentTokenizerMerger to consolidate.

Code Reference

Source Location: Repository: huggingface/datatrove, File: src/datatrove/pipeline/tokens/tokenizer.py (L281-475)

Signature:

class DocumentTokenizer(PipelineStepWithTokenizer):
    def __init__(
        self,
        output_folder: DataFolderLike,
        tokenizer_name_or_path: str,
        local_working_dir: DataFolderLike | None = None,
        save_filename: str | None = None,
        eos_token: str | None = None,
        save_index: bool = True,
        save_loss_metadata: bool = False,
        save_final_metadata: bool = True,
        batch_size: int = 10000,
        max_tokens_per_file: int | None = None,
        seed: int | None = None,
        upload_block_size: int | None = None,
        shuffle_documents: bool = True,
        shuffle_chunk_size: int | None = None,
    ):

Import:

from datatrove.pipeline.tokens import DocumentTokenizer

I/O Contract

Inputs:

Parameter Type Required Description
output_folder DataFolderLike Yes Folder where binary token files are written
tokenizer_name_or_path str Yes HuggingFace tokenizer name or local file path
local_working_dir DataFolderLike or None No Local directory for temporary shuffle files (recommended for remote output)
save_filename str or None No Base filename for output files (default None)
eos_token str or None No endoftext|>")
save_index bool No Save document boundary index files (default True)
save_loss_metadata bool No Save per-token loss masks (default False)
save_final_metadata bool No Save metadata file with tokenizer name and token count (default True)
batch_size int No Documents per tokenization batch (default 10000)
max_tokens_per_file int or None No Split shuffled output at this token count (default None)
seed int or None No Random seed for shuffling (default None)
upload_block_size int or None No fsspec upload block size for remote storage (default None)
shuffle_documents bool No Shuffle document order within each file (default True)
shuffle_chunk_size int or None No Token chunk size for chunk-level shuffling (default None)

Outputs:

  • Binary .ds files -- contiguous packed token arrays (uint16 or uint32)
  • .ds.index files -- uint64 document boundary positions (in tokens)
  • .ds.loss files (optional) -- per-token boolean loss masks
  • .ds.metadata files (optional) -- tokenizer name and total token count

Usage Examples

Example 1 -- Basic tokenization with GPT-2:

from datatrove.pipeline.tokens import DocumentTokenizer

tokenizer = DocumentTokenizer(
    output_folder="/data/tokens/",
    tokenizer_name_or_path="gpt2",
    eos_token="<|endoftext|>",
)

Example 2 -- Tokenization with chunk shuffling for large datasets:

from datatrove.pipeline.tokens import DocumentTokenizer

tokenizer = DocumentTokenizer(
    output_folder="s3://my-bucket/tokens/",
    tokenizer_name_or_path="gpt2",
    local_working_dir="/tmp/tokenizer_work/",
    shuffle_documents=True,
    shuffle_chunk_size=2048 * 8192,
    max_tokens_per_file=1024 * 1024 * 1024,
    seed=42,
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment