Environment:Marker Inc Korea AutoRAG Japanese NLP Dependencies

Knowledge Sources	AutoRAG pyproject.toml ja extra
Domains	NLP, Japanese_Language, RAG
Last Updated	2026-02-12 00:00 GMT

Overview

Japanese NLP tokenization environment providing SudachiPy morphological analyzer for BM25 lexical retrieval on Japanese documents.

Description

This environment provides the `AutoRAG[ja]` optional extra which installs the SudachiPy tokenizer with the core dictionary. SudachiPy is a Japanese morphological analyzer used by the BM25 lexical retrieval module for accurate Japanese text tokenization. It uses `SplitMode.A` (shortest unit) for fine-grained tokenization. Without this extra, Japanese text falls back to space-based or HuggingFace tokenization which is suboptimal for Japanese which does not use spaces between words.

Usage

Use this environment when your corpus contains Japanese-language documents and you need BM25 lexical retrieval. Required for the `sudachipy` tokenizer option in the BM25 module.

System Requirements

Category	Requirement	Notes
OS	Linux, macOS, or Windows	Cross-platform support
Python	>= 3.10	Same as base environment

Dependencies

Japanese Extra Packages

`sudachipy` >= 0.6.8
`sudachidict_core`

Credentials

No credentials required.

Quick Install

# Install AutoRAG with Japanese support
pip install "AutoRAG[ja]"

# Or install individually
pip install sudachipy>=0.6.8 sudachidict_core

Code Evidence

SudachiPy import guard from `autorag/nodes/lexicalretrieval/bm25.py:110-117`:

def tokenize_ja_sudachipy(texts: List[str]) -> List[List[str]]:
    try:
        from sudachipy import dictionary, tokenizer
    except ImportError:
        raise ImportError(
            "You need to install SudachiPy to use 'sudachipy' tokenizer. "
            "Please install SudachiPy by running 'pip install sudachipy'."
        )

Split mode selection from `autorag/nodes/lexicalretrieval/bm25.py:123`:

# Choose the tokenizer mode: NORMAL, SEARCH, A
mode = tokenizer.Tokenizer.SplitMode.A

Common Errors

Error Message	Cause	Solution
`ImportError: You need to install SudachiPy`	SudachiPy not installed	`pip install "AutoRAG[ja]"` or `pip install sudachipy`
`Dictionary not found`	sudachidict_core not installed	`pip install sudachidict_core`

Compatibility Notes

Split Mode A: AutoRAG uses the finest granularity (Mode A). This produces more tokens per document, which is beneficial for BM25 recall but increases index size.
Dictionary: Uses the `core` dictionary by default. For specialized domains, `sudachidict_full` may improve tokenization quality.

Related Pages

Implementation:Marker_Inc_Korea_AutoRAG_Dontknow_Filter_Rule_Based

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment