Implementation:Huggingface Datatrove TokenizerUtils

Knowledge Sources	Huggingface_Datatrove
Domains	NLP, Tokenization, Data Processing
Last Updated	2026-02-14 17:00 GMT

Overview

Provides a tokenizer loading utility function and a base class for pipeline steps that require a HuggingFace Tokenizer instance.

Description

The tokenization.py module contains two primary components. The load_tokenizer function provides a unified way to load a HuggingFace Tokenizer from either a local file path or a pretrained model name on the Hugging Face Hub. It checks whether the provided string is a local file path and dispatches to Tokenizer.from_file or Tokenizer.from_pretrained accordingly.

The PipelineStepWithTokenizer abstract base class extends PipelineStep to add tokenizer management capabilities. It accepts a tokenizer_name_or_path, an optional eos_token, and an optional post_processor (a TemplateProcessing instance). The tokenizer is loaded lazily via a cached_property and optionally configured with a custom post-processor or an EOS token template. If an eos_token is provided without a custom post-processor, the class automatically creates a TemplateProcessing that appends the EOS token to every encoded sequence.

Additionally, the module includes a chunk_doc_ends utility function that partitions a list of document end positions into fixed-size chunks. This is used during token shuffling operations where documents need to be grouped into chunks of a specified size. The class also provides token_size and token_format cached properties that determine the appropriate byte width (2 or 4 bytes) and struct format string ("H" or "I") based on the tokenizer's vocabulary size.

Usage

Use load_tokenizer when you need to load a HuggingFace Tokenizer from either a local file or a pretrained name in a single call. Subclass PipelineStepWithTokenizer when building pipeline steps that require tokenization, such as document tokenizers, token counters, or any processing step that converts text to token IDs.

Code Reference

Source Location

Repository: Huggingface_Datatrove
File: src/datatrove/utils/tokenization.py
Lines: 1-99

Signature

def load_tokenizer(name_or_path: str) -> "Tokenizer": ...

def chunk_doc_ends(doc_ends, shuffle_chunk_size) -> list[list[int]]: ...

class PipelineStepWithTokenizer(PipelineStep, ABC):
    _requires_dependencies = ["tokenizers"]

    def __init__(
        self,
        tokenizer_name_or_path: str | None = None,
        eos_token: str | None = None,
        post_processor: Optional[TemplateProcessing] = None,
    ): ...

    @cached_property
    def token_size(self) -> int: ...

    @cached_property
    def token_format(self) -> str: ...

    @cached_property
    def tokenizer(self) -> "Tokenizer": ...

Import

from datatrove.utils.tokenization import load_tokenizer, PipelineStepWithTokenizer, chunk_doc_ends

I/O Contract

Inputs

Name	Type	Required	Description
name_or_path	str	Yes	A local file path or Hugging Face Hub model name for the tokenizer (load_tokenizer)
tokenizer_name_or_path	str or None	No	Path or name of the tokenizer to load (PipelineStepWithTokenizer)
eos_token	str or None	No	End-of-sequence token string to append via post-processing
post_processor	TemplateProcessing or None	No	Custom post-processor to apply to the tokenizer
doc_ends	list[int]	Yes	List of document end positions in token space (chunk_doc_ends)
shuffle_chunk_size	int	Yes	Size of each chunk for partitioning document ends (chunk_doc_ends)

Outputs

Name	Type	Description
tokenizer	Tokenizer	A loaded and optionally configured HuggingFace Tokenizer instance
token_size	int	Byte width per token: 4 if vocabulary exceeds uint16 max, otherwise 2
token_format	str	Struct format character: "I" for 4-byte tokens, "H" for 2-byte tokens
all_chunks_doc_ends	list[list[int]]	Partitioned document end positions grouped into fixed-size chunks

Usage Examples

Loading a Tokenizer

from datatrove.utils.tokenization import load_tokenizer

# Load from Hugging Face Hub
tokenizer = load_tokenizer("gpt2")

# Load from a local file
tokenizer = load_tokenizer("/path/to/tokenizer.json")

encoded = tokenizer.encode("Hello, world!")
print(encoded.ids)

Subclassing PipelineStepWithTokenizer

from datatrove.utils.tokenization import PipelineStepWithTokenizer
from datatrove.data import DocumentsPipeline

class MyTokenStep(PipelineStepWithTokenizer):
    def __init__(self, tokenizer_name: str):
        super().__init__(tokenizer_name_or_path=tokenizer_name, eos_token="<|endoftext|>")

    def run(self, data: DocumentsPipeline, rank: int = 0, world_size: int = 1):
        for doc in data:
            tokens = self.tokenizer.encode(doc.text)
            # process tokens ...
            yield doc

Related Pages

Principle:Huggingface_Datatrove_Tokenizer_Loading

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment