Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Datatrove TokenizerUtils

From Leeroopedia
Revision as of 13:02, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Huggingface_Datatrove_TokenizerUtils.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains NLP, Tokenization, Data Processing
Last Updated 2026-02-14 17:00 GMT

Overview

Provides a tokenizer loading utility function and a base class for pipeline steps that require a HuggingFace Tokenizer instance.

Description

The tokenization.py module contains two primary components. The load_tokenizer function provides a unified way to load a HuggingFace Tokenizer from either a local file path or a pretrained model name on the Hugging Face Hub. It checks whether the provided string is a local file path and dispatches to Tokenizer.from_file or Tokenizer.from_pretrained accordingly.

The PipelineStepWithTokenizer abstract base class extends PipelineStep to add tokenizer management capabilities. It accepts a tokenizer_name_or_path, an optional eos_token, and an optional post_processor (a TemplateProcessing instance). The tokenizer is loaded lazily via a cached_property and optionally configured with a custom post-processor or an EOS token template. If an eos_token is provided without a custom post-processor, the class automatically creates a TemplateProcessing that appends the EOS token to every encoded sequence.

Additionally, the module includes a chunk_doc_ends utility function that partitions a list of document end positions into fixed-size chunks. This is used during token shuffling operations where documents need to be grouped into chunks of a specified size. The class also provides token_size and token_format cached properties that determine the appropriate byte width (2 or 4 bytes) and struct format string ("H" or "I") based on the tokenizer's vocabulary size.

Usage

Use load_tokenizer when you need to load a HuggingFace Tokenizer from either a local file or a pretrained name in a single call. Subclass PipelineStepWithTokenizer when building pipeline steps that require tokenization, such as document tokenizers, token counters, or any processing step that converts text to token IDs.

Code Reference

Source Location

Signature

def load_tokenizer(name_or_path: str) -> "Tokenizer": ...

def chunk_doc_ends(doc_ends, shuffle_chunk_size) -> list[list[int]]: ...

class PipelineStepWithTokenizer(PipelineStep, ABC):
    _requires_dependencies = ["tokenizers"]

    def __init__(
        self,
        tokenizer_name_or_path: str | None = None,
        eos_token: str | None = None,
        post_processor: Optional[TemplateProcessing] = None,
    ): ...

    @cached_property
    def token_size(self) -> int: ...

    @cached_property
    def token_format(self) -> str: ...

    @cached_property
    def tokenizer(self) -> "Tokenizer": ...

Import

from datatrove.utils.tokenization import load_tokenizer, PipelineStepWithTokenizer, chunk_doc_ends

I/O Contract

Inputs

Name Type Required Description
name_or_path str Yes A local file path or Hugging Face Hub model name for the tokenizer (load_tokenizer)
tokenizer_name_or_path str or None No Path or name of the tokenizer to load (PipelineStepWithTokenizer)
eos_token str or None No End-of-sequence token string to append via post-processing
post_processor TemplateProcessing or None No Custom post-processor to apply to the tokenizer
doc_ends list[int] Yes List of document end positions in token space (chunk_doc_ends)
shuffle_chunk_size int Yes Size of each chunk for partitioning document ends (chunk_doc_ends)

Outputs

Name Type Description
tokenizer Tokenizer A loaded and optionally configured HuggingFace Tokenizer instance
token_size int Byte width per token: 4 if vocabulary exceeds uint16 max, otherwise 2
token_format str Struct format character: "I" for 4-byte tokens, "H" for 2-byte tokens
all_chunks_doc_ends list[list[int]] Partitioned document end positions grouped into fixed-size chunks

Usage Examples

Loading a Tokenizer

from datatrove.utils.tokenization import load_tokenizer

# Load from Hugging Face Hub
tokenizer = load_tokenizer("gpt2")

# Load from a local file
tokenizer = load_tokenizer("/path/to/tokenizer.json")

encoded = tokenizer.encode("Hello, world!")
print(encoded.ids)

Subclassing PipelineStepWithTokenizer

from datatrove.utils.tokenization import PipelineStepWithTokenizer
from datatrove.data import DocumentsPipeline

class MyTokenStep(PipelineStepWithTokenizer):
    def __init__(self, tokenizer_name: str):
        super().__init__(tokenizer_name_or_path=tokenizer_name, eos_token="<|endoftext|>")

    def run(self, data: DocumentsPipeline, rank: int = 0, world_size: int = 1):
        for doc in data:
            tokens = self.tokenizer.encode(doc.text)
            # process tokens ...
            yield doc

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment