Implementation:Huggingface Datatrove WordTokenizers
| Knowledge Sources | |
|---|---|
| Domains | NLP, Text Processing |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
Provides a multilingual word and sentence tokenization framework with language-specific tokenizer implementations and a CSV-driven assignment system mapping languages to their appropriate tokenizer.
Description
The WordTokenizers module implements a strategy pattern for multilingual tokenization. The abstract base class WordTokenizer defines three interfaces: word_tokenize (split text into words), sent_tokenize (split text into sentences), and span_tokenize (return character offset spans for sentences). Twelve concrete implementations wrap various NLP libraries to handle different language families.
SpaCyTokenizer handles most European and CJK languages, using spaCy's blank models with a sentencizer pipe. It includes special handling for Vietnamese (pyvi segmenter), Chinese (jieba segmenter), and Japanese (a custom tokenizer fix for a known spaCy memory leak, registered as datatrove.ja.JapaneseTokenizer). The memory_zone context manager is used to prevent memory leaks. NLTKTokenizer wraps NLTK's punkt tokenizer for languages with punkt models. StanzaTokenizer uses the Stanza NLP library for languages not well-served by other tools. Specialized tokenizers are provided for Thai (ThaiTokenizer using PyThaiNLP), Korean (KiwiTokenizer using kiwipiepy), Khmer, Lao, Tibetan (botok), Burmese (pyidaungsu), and Indic languages (IndicNLP). A WhitespaceTokenizer serves as a regex-based fallback for unsupported languages.
The load_tokenizer_assignments function reads a CSV configuration file that maps ISO language codes and script combinations to tokenizer classes, enabling the system to automatically select the best tokenizer for any of hundreds of supported languages. The load_word_tokenizer function is the main entry point, accepting either a language code string or a pre-instantiated WordTokenizer instance.
Usage
Use this module when you need to tokenize text into words or sentences for any language. Call load_word_tokenizer with a language code to get the appropriate tokenizer, then use word_tokenize, sent_tokenize, or span_tokenize as needed.
Code Reference
Source Location
- Repository: Huggingface_Datatrove
- File: src/datatrove/utils/word_tokenizers.py
- Lines: 1-494
Signature
class WordTokenizer(ABC):
def __init__(self, language: str | None = None): ...
def word_tokenize(self, text: str) -> list[str]: ...
def sent_tokenize(self, text: str) -> list[str]: ...
def span_tokenize(self, text: str) -> list[tuple[int, int]]: ...
class SpaCyTokenizer(WordTokenizer): ...
class NLTKTokenizer(WordTokenizer): ...
class StanzaTokenizer(WordTokenizer): ...
class ThaiTokenizer(WordTokenizer): ...
class IndicNLPTokenizer(WordTokenizer): ...
class KiwiTokenizer(WordTokenizer): ...
class KhmerTokenizer(WordTokenizer): ...
class LaoTokenizer(WordTokenizer): ...
class TibetanTokenizer(WordTokenizer): ...
class WhitespaceTokenizer(WordTokenizer): ...
class BurmeseTokenizer(WhitespaceTokenizer): ...
def load_word_tokenizer(language_or_tok: str | WordTokenizer) -> WordTokenizer: ...
Import
from datatrove.utils.word_tokenizers import load_word_tokenizer, WordTokenizer
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| language_or_tok | str or WordTokenizer | Yes | ISO language code (e.g., "eng", "fra_Latn") or a pre-instantiated WordTokenizer |
| text | str | Yes | The text string to tokenize |
Outputs
| Name | Type | Description |
|---|---|---|
| words | list[str] | List of word tokens (from word_tokenize) |
| sentences | list[str] | List of sentence strings (from sent_tokenize) |
| spans | list[tuple[int, int]] | List of (start, end) character offsets for sentence boundaries (from span_tokenize) |
Usage Examples
Basic Usage
from datatrove.utils.word_tokenizers import load_word_tokenizer
# Load tokenizer for English
tokenizer = load_word_tokenizer("eng")
# Word tokenization
words = tokenizer.word_tokenize("Hello world, how are you?")
# Result: ["Hello", "world", ",", "how", "are", "you", "?"]
# Sentence tokenization
sentences = tokenizer.sent_tokenize("First sentence. Second sentence.")
# Result: ["First sentence.", "Second sentence."]
# Span tokenization (character offsets)
spans = tokenizer.span_tokenize("First sentence. Second sentence.")
# Result: [(0, 15), (16, 32)]
# Use a custom tokenizer directly
from datatrove.utils.word_tokenizers import SpaCyTokenizer
custom_tok = load_word_tokenizer(SpaCyTokenizer("fr"))