Implementation:Huggingface Datatrove DocumentTokenizer
| Sources | Domains | Last Updated |
|---|---|---|
| Huggingface Datatrove | Tokenization, Training_Data | 2026-02-14 |
Overview
Pipeline step that tokenizes documents using HuggingFace fast tokenizers and writes binary .ds token files with optional shuffling and loss masking.
Description
DocumentTokenizer extends PipelineStepWithTokenizer and implements the full tokenization-to-disk pipeline. It batch-tokenizes document text using tokenizer.encode_batch, writes raw token bytes to an unshuffled .ds file via the TokenizedFile helper class, then optionally performs two levels of shuffling: document-level (randomizing document order) and chunk-level (grouping tokens into fixed-size chunks and shuffling those). Each shuffle step produces a new file and removes the previous one.
The TokenizedFile helper manages writing tokens as packed binary structs (H for uint16 or I for uint32), maintaining document boundary indexes, and performing the actual shuffle-copy operation by seeking to document positions in random order.
Usage
Use as a pipeline step after all text processing is complete. Typically run in parallel with multiple ranks, each producing its own output files. Follow with DocumentTokenizerMerger to consolidate.
Code Reference
Source Location: Repository: huggingface/datatrove, File: src/datatrove/pipeline/tokens/tokenizer.py (L281-475)
Signature:
class DocumentTokenizer(PipelineStepWithTokenizer):
def __init__(
self,
output_folder: DataFolderLike,
tokenizer_name_or_path: str,
local_working_dir: DataFolderLike | None = None,
save_filename: str | None = None,
eos_token: str | None = None,
save_index: bool = True,
save_loss_metadata: bool = False,
save_final_metadata: bool = True,
batch_size: int = 10000,
max_tokens_per_file: int | None = None,
seed: int | None = None,
upload_block_size: int | None = None,
shuffle_documents: bool = True,
shuffle_chunk_size: int | None = None,
):
Import:
from datatrove.pipeline.tokens import DocumentTokenizer
I/O Contract
Inputs:
| Parameter | Type | Required | Description |
|---|---|---|---|
| output_folder | DataFolderLike | Yes | Folder where binary token files are written |
| tokenizer_name_or_path | str | Yes | HuggingFace tokenizer name or local file path |
| local_working_dir | DataFolderLike or None | No | Local directory for temporary shuffle files (recommended for remote output) |
| save_filename | str or None | No | Base filename for output files (default None) |
| eos_token | str or None | No | endoftext|>") |
| save_index | bool | No | Save document boundary index files (default True) |
| save_loss_metadata | bool | No | Save per-token loss masks (default False) |
| save_final_metadata | bool | No | Save metadata file with tokenizer name and token count (default True) |
| batch_size | int | No | Documents per tokenization batch (default 10000) |
| max_tokens_per_file | int or None | No | Split shuffled output at this token count (default None) |
| seed | int or None | No | Random seed for shuffling (default None) |
| upload_block_size | int or None | No | fsspec upload block size for remote storage (default None) |
| shuffle_documents | bool | No | Shuffle document order within each file (default True) |
| shuffle_chunk_size | int or None | No | Token chunk size for chunk-level shuffling (default None) |
Outputs:
- Binary .ds files -- contiguous packed token arrays (uint16 or uint32)
- .ds.index files -- uint64 document boundary positions (in tokens)
- .ds.loss files (optional) -- per-token boolean loss masks
- .ds.metadata files (optional) -- tokenizer name and total token count
Usage Examples
Example 1 -- Basic tokenization with GPT-2:
from datatrove.pipeline.tokens import DocumentTokenizer
tokenizer = DocumentTokenizer(
output_folder="/data/tokens/",
tokenizer_name_or_path="gpt2",
eos_token="<|endoftext|>",
)
Example 2 -- Tokenization with chunk shuffling for large datasets:
from datatrove.pipeline.tokens import DocumentTokenizer
tokenizer = DocumentTokenizer(
output_folder="s3://my-bucket/tokens/",
tokenizer_name_or_path="gpt2",
local_working_dir="/tmp/tokenizer_work/",
shuffle_documents=True,
shuffle_chunk_size=2048 * 8192,
max_tokens_per_file=1024 * 1024 * 1024,
seed=42,
)