Implementation:Huggingface Datatrove JapaneseTokenizer

Knowledge Sources	Huggingface_Datatrove
Domains	NLP, Tokenization
Last Updated	2026-02-14 17:00 GMT

Overview

JapaneseTokenizer is a custom spaCy tokenizer for Japanese text that fixes a memory leak in spaCy's built-in Japanese tokenizer (spaCy issue #13684), using SudachiPy for morphological analysis.

Description

This module provides a memory-safe replacement for spaCy's built-in Japanese tokenizer. The core class, JapaneseTokenizer, extends spaCy's DummyTokenizer and uses SudachiPy for morphological analysis of Japanese text. The key fix is the deliberate omission of the token.morph = MorphAnalysis(...) assignment, which is the source of the memory leak in the upstream spaCy implementation.

The tokenizer converts SudachiPy morphemes into DetailedToken named tuples containing surface form, POS tag, inflection, lemma, normalized form, reading, and optional sub-tokens. It handles POS resolution through a multi-level lookup: first checking orthography-based tag maps (TAG_ORTH_MAP), then tag bigram maps (TAG_BIGRAM_MAP), and finally falling back to unigram tag maps (TAG_MAP). This implements the Universal Dependencies POS mapping rules where some POS tags depend on context.

The module supports SudachiPy's three split modes (A, B, C) for varying levels of morphological decomposition. In modes B and C, sub-token information is stored in doc.user_data["sub_tokens"]. The tokenizer also handles the merging of continuous space tokens produced by SudachiPy's internal normalization.

The Japanese language class and JapaneseDefaults provide the spaCy language integration, and the tokenizer is registered via @registry.tokenizers("datatrove.ja.JapaneseTokenizer") for use in spaCy pipelines.

Usage

Use this tokenizer for Japanese text processing within datatrove pipelines, particularly for sentence-level operations like sentence deduplication, where memory-safe tokenization is critical for long-running batch processing jobs.

Code Reference

Source Location

Repository: Huggingface_Datatrove
File: src/datatrove/utils/japanese_tokenizer.py
Lines: 1-310

Signature

@registry.tokenizers("datatrove.ja.JapaneseTokenizer")
def create_tokenizer(split_mode: Optional[str] = None):

class JapaneseTokenizer(DummyTokenizer):
    def __init__(self, vocab: Vocab, split_mode: Optional[str] = None) -> None:
    def __call__(self, text: str) -> Doc:

class Japanese(Language):
    lang = "ja"
    Defaults = JapaneseDefaults

Import

from datatrove.utils.japanese_tokenizer import Japanese, JapaneseTokenizer

I/O Contract

Inputs

Name	Type	Required	Description
vocab	Vocab	Yes	spaCy Vocab object for the language
split_mode	str	No	SudachiPy split mode: "A" (default), "B", or "C"
text	str	Yes (for __call__)	Japanese text to tokenize

Outputs

Name	Type	Description
Doc	spacy.tokens.Doc	spaCy Doc with tokens, POS tags, lemmas, and norms populated

Usage Examples

Basic Usage

from datatrove.utils.japanese_tokenizer import Japanese

# Create a Japanese NLP pipeline with the memory-safe tokenizer
nlp = Japanese()
doc = nlp("東京は日本の首都です。")

for token in doc:
    print(token.text, token.pos_, token.tag_, token.lemma_)

With Split Mode

import spacy
from datatrove.utils.japanese_tokenizer import Japanese

# Use split mode B for finer morphological decomposition
nlp = Japanese()
nlp.tokenizer = nlp.tokenizer.__class__(nlp.vocab, split_mode="B")
doc = nlp("国際連合教育科学文化機関")

for token in doc:
    print(token.text, token.tag_)

Related Pages

Principle:Huggingface_Datatrove_Japanese_Word_Tokenization

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment