Principle:Huggingface Datatrove Japanese Word Tokenization

Knowledge Sources	Huggingface_Datatrove
Domains	NLP, Tokenization
Last Updated	2026-02-14 17:00 GMT

Overview

Japanese Word Tokenization is the principle of segmenting Japanese text into words and morphemes using dictionary-based morphological analysis, with memory-safe processing for large-scale batch workloads.

Description

Unlike alphabetic languages where words are separated by spaces, Japanese text is written without explicit word boundaries. Tokenizing Japanese requires morphological analysis, which uses a dictionary of known words and their grammatical properties to find the most likely segmentation of a text string. This is a fundamental preprocessing step for any NLP task involving Japanese text.

The datatrove implementation uses SudachiPy, a modern Japanese morphological analyzer developed by Works Applications, which provides three levels of segmentation granularity (split modes A, B, C) and integrates with spaCy's NLP pipeline framework. A critical aspect of the implementation is the fix for a memory leak in spaCy's built-in Japanese tokenizer, which makes it unsafe for long-running batch processing of large text corpora.

Usage

Apply this principle when processing Japanese text in NLP pipelines, particularly for sentence tokenization, word counting, and text normalization operations that require language-aware word boundaries.

Theoretical Basis

Japanese word tokenization in this implementation relies on several key concepts:

Dictionary-Based Morphological Analysis: SudachiPy uses a lattice-based approach with the Sudachi dictionary to find the optimal segmentation of input text. Each possible word in the dictionary is assigned features including part-of-speech, conjugation type, lemma, normalized form, and reading.

Split Modes: SudachiPy provides three granularity levels:
- Mode A: Most fine-grained segmentation (default). Each morpheme is a separate token.
- Mode B: Intermediate segmentation. Some compound words are kept together.
- Mode C: Coarsest segmentation. More compound words and named entities are kept as single tokens.
- When using modes B or C, sub-token information (the mode A segmentation) is preserved for downstream use.

Universal Dependencies POS Mapping: Japanese POS tags from the Unidic-based system must be mapped to Universal Dependencies (UD) POS tags. This mapping is context-dependent:
- Orthography-based rules: Some tokens have their UD POS determined by their surface form (e.g., specific characters)
- Bigram rules: Some POS tags are resolved based on the tag of the following token
- Unigram fallback: When no context-dependent rule applies, a direct tag-to-POS mapping is used

Memory Leak Prevention: The spaCy MorphAnalysis object creates a reference cycle that prevents garbage collection in long-running processes. By omitting the token.morph assignment, the tokenizer avoids this leak at the cost of not populating morphological analysis features, which are rarely needed in datatrove's text processing pipelines.

Space Token Handling: SudachiPy normalizes text internally and produces individual space character tokens. The tokenizer merges consecutive space tokens and properly tracks whitespace between content tokens for accurate text reconstruction.

Related Pages

Implementation:Huggingface_Datatrove_JapaneseTokenizer

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment