Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Environment:Marker Inc Korea AutoRAG Japanese NLP Dependencies

From Leeroopedia
Knowledge Sources
Domains NLP, Japanese_Language, RAG
Last Updated 2026-02-12 00:00 GMT

Overview

Japanese NLP tokenization environment providing SudachiPy morphological analyzer for BM25 lexical retrieval on Japanese documents.

Description

This environment provides the `AutoRAG[ja]` optional extra which installs the SudachiPy tokenizer with the core dictionary. SudachiPy is a Japanese morphological analyzer used by the BM25 lexical retrieval module for accurate Japanese text tokenization. It uses `SplitMode.A` (shortest unit) for fine-grained tokenization. Without this extra, Japanese text falls back to space-based or HuggingFace tokenization which is suboptimal for Japanese which does not use spaces between words.

Usage

Use this environment when your corpus contains Japanese-language documents and you need BM25 lexical retrieval. Required for the `sudachipy` tokenizer option in the BM25 module.

System Requirements

Category Requirement Notes
OS Linux, macOS, or Windows Cross-platform support
Python >= 3.10 Same as base environment

Dependencies

Japanese Extra Packages

  • `sudachipy` >= 0.6.8
  • `sudachidict_core`

Credentials

No credentials required.

Quick Install

# Install AutoRAG with Japanese support
pip install "AutoRAG[ja]"

# Or install individually
pip install sudachipy>=0.6.8 sudachidict_core

Code Evidence

SudachiPy import guard from `autorag/nodes/lexicalretrieval/bm25.py:110-117`:

def tokenize_ja_sudachipy(texts: List[str]) -> List[List[str]]:
    try:
        from sudachipy import dictionary, tokenizer
    except ImportError:
        raise ImportError(
            "You need to install SudachiPy to use 'sudachipy' tokenizer. "
            "Please install SudachiPy by running 'pip install sudachipy'."
        )

Split mode selection from `autorag/nodes/lexicalretrieval/bm25.py:123`:

# Choose the tokenizer mode: NORMAL, SEARCH, A
mode = tokenizer.Tokenizer.SplitMode.A

Common Errors

Error Message Cause Solution
`ImportError: You need to install SudachiPy` SudachiPy not installed `pip install "AutoRAG[ja]"` or `pip install sudachipy`
`Dictionary not found` sudachidict_core not installed `pip install sudachidict_core`

Compatibility Notes

  • Split Mode A: AutoRAG uses the finest granularity (Mode A). This produces more tokens per document, which is beneficial for BM25 recall but increases index size.
  • Dictionary: Uses the `core` dictionary by default. For specialized domains, `sudachidict_full` may improve tokenization quality.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment