Environment:Marker Inc Korea AutoRAG Japanese NLP Dependencies
| Knowledge Sources | |
|---|---|
| Domains | NLP, Japanese_Language, RAG |
| Last Updated | 2026-02-12 00:00 GMT |
Overview
Japanese NLP tokenization environment providing SudachiPy morphological analyzer for BM25 lexical retrieval on Japanese documents.
Description
This environment provides the `AutoRAG[ja]` optional extra which installs the SudachiPy tokenizer with the core dictionary. SudachiPy is a Japanese morphological analyzer used by the BM25 lexical retrieval module for accurate Japanese text tokenization. It uses `SplitMode.A` (shortest unit) for fine-grained tokenization. Without this extra, Japanese text falls back to space-based or HuggingFace tokenization which is suboptimal for Japanese which does not use spaces between words.
Usage
Use this environment when your corpus contains Japanese-language documents and you need BM25 lexical retrieval. Required for the `sudachipy` tokenizer option in the BM25 module.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux, macOS, or Windows | Cross-platform support |
| Python | >= 3.10 | Same as base environment |
Dependencies
Japanese Extra Packages
- `sudachipy` >= 0.6.8
- `sudachidict_core`
Credentials
No credentials required.
Quick Install
# Install AutoRAG with Japanese support
pip install "AutoRAG[ja]"
# Or install individually
pip install sudachipy>=0.6.8 sudachidict_core
Code Evidence
SudachiPy import guard from `autorag/nodes/lexicalretrieval/bm25.py:110-117`:
def tokenize_ja_sudachipy(texts: List[str]) -> List[List[str]]:
try:
from sudachipy import dictionary, tokenizer
except ImportError:
raise ImportError(
"You need to install SudachiPy to use 'sudachipy' tokenizer. "
"Please install SudachiPy by running 'pip install sudachipy'."
)
Split mode selection from `autorag/nodes/lexicalretrieval/bm25.py:123`:
# Choose the tokenizer mode: NORMAL, SEARCH, A
mode = tokenizer.Tokenizer.SplitMode.A
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `ImportError: You need to install SudachiPy` | SudachiPy not installed | `pip install "AutoRAG[ja]"` or `pip install sudachipy` |
| `Dictionary not found` | sudachidict_core not installed | `pip install sudachidict_core` |
Compatibility Notes
- Split Mode A: AutoRAG uses the finest granularity (Mode A). This produces more tokens per document, which is beneficial for BM25 recall but increases index size.
- Dictionary: Uses the `core` dictionary by default. For specialized domains, `sudachidict_full` may improve tokenization quality.