Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Datatrove WordTokenizers

From Leeroopedia
Revision as of 13:02, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Huggingface_Datatrove_WordTokenizers.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains NLP, Text Processing
Last Updated 2026-02-14 17:00 GMT

Overview

Provides a multilingual word and sentence tokenization framework with language-specific tokenizer implementations and a CSV-driven assignment system mapping languages to their appropriate tokenizer.

Description

The WordTokenizers module implements a strategy pattern for multilingual tokenization. The abstract base class WordTokenizer defines three interfaces: word_tokenize (split text into words), sent_tokenize (split text into sentences), and span_tokenize (return character offset spans for sentences). Twelve concrete implementations wrap various NLP libraries to handle different language families.

SpaCyTokenizer handles most European and CJK languages, using spaCy's blank models with a sentencizer pipe. It includes special handling for Vietnamese (pyvi segmenter), Chinese (jieba segmenter), and Japanese (a custom tokenizer fix for a known spaCy memory leak, registered as datatrove.ja.JapaneseTokenizer). The memory_zone context manager is used to prevent memory leaks. NLTKTokenizer wraps NLTK's punkt tokenizer for languages with punkt models. StanzaTokenizer uses the Stanza NLP library for languages not well-served by other tools. Specialized tokenizers are provided for Thai (ThaiTokenizer using PyThaiNLP), Korean (KiwiTokenizer using kiwipiepy), Khmer, Lao, Tibetan (botok), Burmese (pyidaungsu), and Indic languages (IndicNLP). A WhitespaceTokenizer serves as a regex-based fallback for unsupported languages.

The load_tokenizer_assignments function reads a CSV configuration file that maps ISO language codes and script combinations to tokenizer classes, enabling the system to automatically select the best tokenizer for any of hundreds of supported languages. The load_word_tokenizer function is the main entry point, accepting either a language code string or a pre-instantiated WordTokenizer instance.

Usage

Use this module when you need to tokenize text into words or sentences for any language. Call load_word_tokenizer with a language code to get the appropriate tokenizer, then use word_tokenize, sent_tokenize, or span_tokenize as needed.

Code Reference

Source Location

Signature

class WordTokenizer(ABC):
    def __init__(self, language: str | None = None): ...
    def word_tokenize(self, text: str) -> list[str]: ...
    def sent_tokenize(self, text: str) -> list[str]: ...
    def span_tokenize(self, text: str) -> list[tuple[int, int]]: ...

class SpaCyTokenizer(WordTokenizer): ...
class NLTKTokenizer(WordTokenizer): ...
class StanzaTokenizer(WordTokenizer): ...
class ThaiTokenizer(WordTokenizer): ...
class IndicNLPTokenizer(WordTokenizer): ...
class KiwiTokenizer(WordTokenizer): ...
class KhmerTokenizer(WordTokenizer): ...
class LaoTokenizer(WordTokenizer): ...
class TibetanTokenizer(WordTokenizer): ...
class WhitespaceTokenizer(WordTokenizer): ...
class BurmeseTokenizer(WhitespaceTokenizer): ...

def load_word_tokenizer(language_or_tok: str | WordTokenizer) -> WordTokenizer: ...

Import

from datatrove.utils.word_tokenizers import load_word_tokenizer, WordTokenizer

I/O Contract

Inputs

Name Type Required Description
language_or_tok str or WordTokenizer Yes ISO language code (e.g., "eng", "fra_Latn") or a pre-instantiated WordTokenizer
text str Yes The text string to tokenize

Outputs

Name Type Description
words list[str] List of word tokens (from word_tokenize)
sentences list[str] List of sentence strings (from sent_tokenize)
spans list[tuple[int, int]] List of (start, end) character offsets for sentence boundaries (from span_tokenize)

Usage Examples

Basic Usage

from datatrove.utils.word_tokenizers import load_word_tokenizer

# Load tokenizer for English
tokenizer = load_word_tokenizer("eng")

# Word tokenization
words = tokenizer.word_tokenize("Hello world, how are you?")
# Result: ["Hello", "world", ",", "how", "are", "you", "?"]

# Sentence tokenization
sentences = tokenizer.sent_tokenize("First sentence. Second sentence.")
# Result: ["First sentence.", "Second sentence."]

# Span tokenization (character offsets)
spans = tokenizer.span_tokenize("First sentence. Second sentence.")
# Result: [(0, 15), (16, 32)]

# Use a custom tokenizer directly
from datatrove.utils.word_tokenizers import SpaCyTokenizer
custom_tok = load_word_tokenizer(SpaCyTokenizer("fr"))

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment