Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Datatrove JapaneseTokenizer

From Leeroopedia
Knowledge Sources
Domains NLP, Tokenization
Last Updated 2026-02-14 17:00 GMT

Overview

JapaneseTokenizer is a custom spaCy tokenizer for Japanese text that fixes a memory leak in spaCy's built-in Japanese tokenizer (spaCy issue #13684), using SudachiPy for morphological analysis.

Description

This module provides a memory-safe replacement for spaCy's built-in Japanese tokenizer. The core class, JapaneseTokenizer, extends spaCy's DummyTokenizer and uses SudachiPy for morphological analysis of Japanese text. The key fix is the deliberate omission of the token.morph = MorphAnalysis(...) assignment, which is the source of the memory leak in the upstream spaCy implementation.

The tokenizer converts SudachiPy morphemes into DetailedToken named tuples containing surface form, POS tag, inflection, lemma, normalized form, reading, and optional sub-tokens. It handles POS resolution through a multi-level lookup: first checking orthography-based tag maps (TAG_ORTH_MAP), then tag bigram maps (TAG_BIGRAM_MAP), and finally falling back to unigram tag maps (TAG_MAP). This implements the Universal Dependencies POS mapping rules where some POS tags depend on context.

The module supports SudachiPy's three split modes (A, B, C) for varying levels of morphological decomposition. In modes B and C, sub-token information is stored in doc.user_data["sub_tokens"]. The tokenizer also handles the merging of continuous space tokens produced by SudachiPy's internal normalization.

The Japanese language class and JapaneseDefaults provide the spaCy language integration, and the tokenizer is registered via @registry.tokenizers("datatrove.ja.JapaneseTokenizer") for use in spaCy pipelines.

Usage

Use this tokenizer for Japanese text processing within datatrove pipelines, particularly for sentence-level operations like sentence deduplication, where memory-safe tokenization is critical for long-running batch processing jobs.

Code Reference

Source Location

Signature

@registry.tokenizers("datatrove.ja.JapaneseTokenizer")
def create_tokenizer(split_mode: Optional[str] = None):

class JapaneseTokenizer(DummyTokenizer):
    def __init__(self, vocab: Vocab, split_mode: Optional[str] = None) -> None:
    def __call__(self, text: str) -> Doc:

class Japanese(Language):
    lang = "ja"
    Defaults = JapaneseDefaults

Import

from datatrove.utils.japanese_tokenizer import Japanese, JapaneseTokenizer

I/O Contract

Inputs

Name Type Required Description
vocab Vocab Yes spaCy Vocab object for the language
split_mode str No SudachiPy split mode: "A" (default), "B", or "C"
text str Yes (for __call__) Japanese text to tokenize

Outputs

Name Type Description
Doc spacy.tokens.Doc spaCy Doc with tokens, POS tags, lemmas, and norms populated

Usage Examples

Basic Usage

from datatrove.utils.japanese_tokenizer import Japanese

# Create a Japanese NLP pipeline with the memory-safe tokenizer
nlp = Japanese()
doc = nlp("東京は日本の首都です。")

for token in doc:
    print(token.text, token.pos_, token.tag_, token.lemma_)

With Split Mode

import spacy
from datatrove.utils.japanese_tokenizer import Japanese

# Use split mode B for finer morphological decomposition
nlp = Japanese()
nlp.tokenizer = nlp.tokenizer.__class__(nlp.vocab, split_mode="B")
doc = nlp("国際連合教育科学文化機関")

for token in doc:
    print(token.text, token.tag_)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment