Implementation:Huggingface Datatrove TokenizerUtils
| Knowledge Sources | |
|---|---|
| Domains | NLP, Tokenization, Data Processing |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
Provides a tokenizer loading utility function and a base class for pipeline steps that require a HuggingFace Tokenizer instance.
Description
The tokenization.py module contains two primary components. The load_tokenizer function provides a unified way to load a HuggingFace Tokenizer from either a local file path or a pretrained model name on the Hugging Face Hub. It checks whether the provided string is a local file path and dispatches to Tokenizer.from_file or Tokenizer.from_pretrained accordingly.
The PipelineStepWithTokenizer abstract base class extends PipelineStep to add tokenizer management capabilities. It accepts a tokenizer_name_or_path, an optional eos_token, and an optional post_processor (a TemplateProcessing instance). The tokenizer is loaded lazily via a cached_property and optionally configured with a custom post-processor or an EOS token template. If an eos_token is provided without a custom post-processor, the class automatically creates a TemplateProcessing that appends the EOS token to every encoded sequence.
Additionally, the module includes a chunk_doc_ends utility function that partitions a list of document end positions into fixed-size chunks. This is used during token shuffling operations where documents need to be grouped into chunks of a specified size. The class also provides token_size and token_format cached properties that determine the appropriate byte width (2 or 4 bytes) and struct format string ("H" or "I") based on the tokenizer's vocabulary size.
Usage
Use load_tokenizer when you need to load a HuggingFace Tokenizer from either a local file or a pretrained name in a single call. Subclass PipelineStepWithTokenizer when building pipeline steps that require tokenization, such as document tokenizers, token counters, or any processing step that converts text to token IDs.
Code Reference
Source Location
- Repository: Huggingface_Datatrove
- File: src/datatrove/utils/tokenization.py
- Lines: 1-99
Signature
def load_tokenizer(name_or_path: str) -> "Tokenizer": ...
def chunk_doc_ends(doc_ends, shuffle_chunk_size) -> list[list[int]]: ...
class PipelineStepWithTokenizer(PipelineStep, ABC):
_requires_dependencies = ["tokenizers"]
def __init__(
self,
tokenizer_name_or_path: str | None = None,
eos_token: str | None = None,
post_processor: Optional[TemplateProcessing] = None,
): ...
@cached_property
def token_size(self) -> int: ...
@cached_property
def token_format(self) -> str: ...
@cached_property
def tokenizer(self) -> "Tokenizer": ...
Import
from datatrove.utils.tokenization import load_tokenizer, PipelineStepWithTokenizer, chunk_doc_ends
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| name_or_path | str | Yes | A local file path or Hugging Face Hub model name for the tokenizer (load_tokenizer) |
| tokenizer_name_or_path | str or None | No | Path or name of the tokenizer to load (PipelineStepWithTokenizer) |
| eos_token | str or None | No | End-of-sequence token string to append via post-processing |
| post_processor | TemplateProcessing or None | No | Custom post-processor to apply to the tokenizer |
| doc_ends | list[int] | Yes | List of document end positions in token space (chunk_doc_ends) |
| shuffle_chunk_size | int | Yes | Size of each chunk for partitioning document ends (chunk_doc_ends) |
Outputs
| Name | Type | Description |
|---|---|---|
| tokenizer | Tokenizer | A loaded and optionally configured HuggingFace Tokenizer instance |
| token_size | int | Byte width per token: 4 if vocabulary exceeds uint16 max, otherwise 2 |
| token_format | str | Struct format character: "I" for 4-byte tokens, "H" for 2-byte tokens |
| all_chunks_doc_ends | list[list[int]] | Partitioned document end positions grouped into fixed-size chunks |
Usage Examples
Loading a Tokenizer
from datatrove.utils.tokenization import load_tokenizer
# Load from Hugging Face Hub
tokenizer = load_tokenizer("gpt2")
# Load from a local file
tokenizer = load_tokenizer("/path/to/tokenizer.json")
encoded = tokenizer.encode("Hello, world!")
print(encoded.ids)
Subclassing PipelineStepWithTokenizer
from datatrove.utils.tokenization import PipelineStepWithTokenizer
from datatrove.data import DocumentsPipeline
class MyTokenStep(PipelineStepWithTokenizer):
def __init__(self, tokenizer_name: str):
super().__init__(tokenizer_name_or_path=tokenizer_name, eos_token="<|endoftext|>")
def run(self, data: DocumentsPipeline, rank: int = 0, world_size: int = 1):
for doc in data:
tokens = self.tokenizer.encode(doc.text)
# process tokens ...
yield doc