Environment:Marker Inc Korea AutoRAG Korean NLP Dependencies

Knowledge Sources	AutoRAG pyproject.toml ko extra
Domains	NLP, Korean_Language, RAG
Last Updated	2026-02-12 00:00 GMT

Overview

Korean NLP tokenization environment providing Kiwi and KoNLPy morphological analyzers for BM25 lexical retrieval and text splitting on Korean documents.

Description

This environment provides the `AutoRAG[ko]` optional extra which installs Korean-language NLP tokenizers. It includes kiwipiepy (Kiwi morphological analyzer) for high-quality Korean tokenization and konlpy (Korean NLP toolkit) which provides Kkma and Okt tokenizers. These tokenizers are used by the BM25 lexical retrieval module and the Korean sentence splitter in the data parsing pipeline. Without this extra, Korean text will fall back to space-based tokenization which produces poor BM25 retrieval quality.

Usage

Use this environment when your corpus contains Korean-language documents and you need accurate BM25 lexical retrieval or Korean sentence splitting. Required for the `ko_kiwi`, `ko_kkma`, and `ko_okt` tokenizer options in the BM25 retrieval module.

System Requirements

Category	Requirement	Notes
OS	Linux or macOS	KoNLPy requires Java (JDK) on some platforms
Java	JDK 8+ (for KoNLPy)	Required by Kkma and Okt backends
Python	>= 3.10	Same as base environment

Dependencies

Korean Extra Packages

`kiwipiepy` >= 0.18.0
`konlpy` >= 0.6.0

Credentials

No credentials required.

Quick Install

# Install AutoRAG with Korean support
pip install "AutoRAG[ko]"

# Or install individually
pip install kiwipiepy>=0.18.0 konlpy>=0.6.0

Code Evidence

Kiwi tokenizer import guard from `autorag/nodes/lexicalretrieval/bm25.py:29-36`:

def tokenize_ko_kiwi(texts: List[str]) -> List[List[str]]:
    try:
        from kiwipiepy import Kiwi, Token
    except ImportError:
        raise ImportError(
            "You need to install kiwipiepy to use 'ko_kiwi' tokenizer. "
            "Please install kiwipiepy by running 'pip install kiwipiepy'. "
            "Or install Korean version of AutoRAG by running 'pip install AutoRAG[ko]'."
        )

Kkma tokenizer import guard from `autorag/nodes/lexicalretrieval/bm25.py:54-61`:

def tokenize_ko_kkma(texts: List[str]) -> List[List[str]]:
    try:
        from konlpy.tag import Kkma
    except ImportError:
        raise ImportError(
            "You need to install konlpy to use 'ko_kkma' tokenizer. "
            "Please install konlpy by running 'pip install konlpy'. "
            "Or install Korean version of AutoRAG by running 'pip install AutoRAG[ko]'."
        )

Korean sentence splitter from `autorag/data/__init__.py:90-97`:

try:
    from kiwipiepy import Kiwi
except ImportError:
    raise ImportError(
        "You need to install kiwipiepy to use 'ko_kiwi' tokenizer. "
        "Please install kiwipiepy by running 'pip install kiwipiepy'. "
        "Or install Korean version of AutoRAG by running 'pip install AutoRAG[ko]'."
    )

Common Errors

Error Message	Cause	Solution
`ImportError: You need to install kiwipiepy`	kiwipiepy not installed	`pip install "AutoRAG[ko]"` or `pip install kiwipiepy`
`ImportError: You need to install konlpy`	konlpy not installed	`pip install "AutoRAG[ko]"` or `pip install konlpy`
`UnicodeDecodeError` in Kiwi tokenization	Malformed Unicode in input	Handled internally by `extract_form_safe()` which returns a space on error
Java-related errors from KoNLPy	JDK not installed	Install JDK 8+ (e.g., `apt install default-jdk` on Ubuntu)

Compatibility Notes

Kiwi vs KoNLPy: Kiwi (`ko_kiwi`) is pure Python with no Java dependency. KoNLPy (`ko_kkma`, `ko_okt`) requires Java Runtime.
Fallback: If Korean tokenizers are not installed, you can still use `porter_stemmer`, `space`, or HuggingFace tokenizers for BM25 but retrieval quality on Korean text will be degraded.
Unicode safety: The Kiwi tokenizer includes a `UnicodeDecodeError` handler that replaces problematic tokens with spaces rather than crashing.

Related Pages

Implementation:Marker_Inc_Korea_AutoRAG_Dontknow_Filter_Rule_Based

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment