Implementation:Huggingface Datatrove JapaneseTokenizer
| Knowledge Sources | |
|---|---|
| Domains | NLP, Tokenization |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
JapaneseTokenizer is a custom spaCy tokenizer for Japanese text that fixes a memory leak in spaCy's built-in Japanese tokenizer (spaCy issue #13684), using SudachiPy for morphological analysis.
Description
This module provides a memory-safe replacement for spaCy's built-in Japanese tokenizer. The core class, JapaneseTokenizer, extends spaCy's DummyTokenizer and uses SudachiPy for morphological analysis of Japanese text. The key fix is the deliberate omission of the token.morph = MorphAnalysis(...) assignment, which is the source of the memory leak in the upstream spaCy implementation.
The tokenizer converts SudachiPy morphemes into DetailedToken named tuples containing surface form, POS tag, inflection, lemma, normalized form, reading, and optional sub-tokens. It handles POS resolution through a multi-level lookup: first checking orthography-based tag maps (TAG_ORTH_MAP), then tag bigram maps (TAG_BIGRAM_MAP), and finally falling back to unigram tag maps (TAG_MAP). This implements the Universal Dependencies POS mapping rules where some POS tags depend on context.
The module supports SudachiPy's three split modes (A, B, C) for varying levels of morphological decomposition. In modes B and C, sub-token information is stored in doc.user_data["sub_tokens"]. The tokenizer also handles the merging of continuous space tokens produced by SudachiPy's internal normalization.
The Japanese language class and JapaneseDefaults provide the spaCy language integration, and the tokenizer is registered via @registry.tokenizers("datatrove.ja.JapaneseTokenizer") for use in spaCy pipelines.
Usage
Use this tokenizer for Japanese text processing within datatrove pipelines, particularly for sentence-level operations like sentence deduplication, where memory-safe tokenization is critical for long-running batch processing jobs.
Code Reference
Source Location
- Repository: Huggingface_Datatrove
- File: src/datatrove/utils/japanese_tokenizer.py
- Lines: 1-310
Signature
@registry.tokenizers("datatrove.ja.JapaneseTokenizer")
def create_tokenizer(split_mode: Optional[str] = None):
class JapaneseTokenizer(DummyTokenizer):
def __init__(self, vocab: Vocab, split_mode: Optional[str] = None) -> None:
def __call__(self, text: str) -> Doc:
class Japanese(Language):
lang = "ja"
Defaults = JapaneseDefaults
Import
from datatrove.utils.japanese_tokenizer import Japanese, JapaneseTokenizer
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| vocab | Vocab | Yes | spaCy Vocab object for the language |
| split_mode | str | No | SudachiPy split mode: "A" (default), "B", or "C" |
| text | str | Yes (for __call__) | Japanese text to tokenize |
Outputs
| Name | Type | Description |
|---|---|---|
| Doc | spacy.tokens.Doc | spaCy Doc with tokens, POS tags, lemmas, and norms populated |
Usage Examples
Basic Usage
from datatrove.utils.japanese_tokenizer import Japanese
# Create a Japanese NLP pipeline with the memory-safe tokenizer
nlp = Japanese()
doc = nlp("東京は日本の首都です。")
for token in doc:
print(token.text, token.pos_, token.tag_, token.lemma_)
With Split Mode
import spacy
from datatrove.utils.japanese_tokenizer import Japanese
# Use split mode B for finer morphological decomposition
nlp = Japanese()
nlp.tokenizer = nlp.tokenizer.__class__(nlp.vocab, split_mode="B")
doc = nlp("国際連合教育科学文化機関")
for token in doc:
print(token.text, token.tag_)