Environment:Marker Inc Korea AutoRAG Korean NLP Dependencies
| Knowledge Sources | |
|---|---|
| Domains | NLP, Korean_Language, RAG |
| Last Updated | 2026-02-12 00:00 GMT |
Overview
Korean NLP tokenization environment providing Kiwi and KoNLPy morphological analyzers for BM25 lexical retrieval and text splitting on Korean documents.
Description
This environment provides the `AutoRAG[ko]` optional extra which installs Korean-language NLP tokenizers. It includes kiwipiepy (Kiwi morphological analyzer) for high-quality Korean tokenization and konlpy (Korean NLP toolkit) which provides Kkma and Okt tokenizers. These tokenizers are used by the BM25 lexical retrieval module and the Korean sentence splitter in the data parsing pipeline. Without this extra, Korean text will fall back to space-based tokenization which produces poor BM25 retrieval quality.
Usage
Use this environment when your corpus contains Korean-language documents and you need accurate BM25 lexical retrieval or Korean sentence splitting. Required for the `ko_kiwi`, `ko_kkma`, and `ko_okt` tokenizer options in the BM25 retrieval module.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux or macOS | KoNLPy requires Java (JDK) on some platforms |
| Java | JDK 8+ (for KoNLPy) | Required by Kkma and Okt backends |
| Python | >= 3.10 | Same as base environment |
Dependencies
Korean Extra Packages
- `kiwipiepy` >= 0.18.0
- `konlpy` >= 0.6.0
Credentials
No credentials required.
Quick Install
# Install AutoRAG with Korean support
pip install "AutoRAG[ko]"
# Or install individually
pip install kiwipiepy>=0.18.0 konlpy>=0.6.0
Code Evidence
Kiwi tokenizer import guard from `autorag/nodes/lexicalretrieval/bm25.py:29-36`:
def tokenize_ko_kiwi(texts: List[str]) -> List[List[str]]:
try:
from kiwipiepy import Kiwi, Token
except ImportError:
raise ImportError(
"You need to install kiwipiepy to use 'ko_kiwi' tokenizer. "
"Please install kiwipiepy by running 'pip install kiwipiepy'. "
"Or install Korean version of AutoRAG by running 'pip install AutoRAG[ko]'."
)
Kkma tokenizer import guard from `autorag/nodes/lexicalretrieval/bm25.py:54-61`:
def tokenize_ko_kkma(texts: List[str]) -> List[List[str]]:
try:
from konlpy.tag import Kkma
except ImportError:
raise ImportError(
"You need to install konlpy to use 'ko_kkma' tokenizer. "
"Please install konlpy by running 'pip install konlpy'. "
"Or install Korean version of AutoRAG by running 'pip install AutoRAG[ko]'."
)
Korean sentence splitter from `autorag/data/__init__.py:90-97`:
try:
from kiwipiepy import Kiwi
except ImportError:
raise ImportError(
"You need to install kiwipiepy to use 'ko_kiwi' tokenizer. "
"Please install kiwipiepy by running 'pip install kiwipiepy'. "
"Or install Korean version of AutoRAG by running 'pip install AutoRAG[ko]'."
)
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `ImportError: You need to install kiwipiepy` | kiwipiepy not installed | `pip install "AutoRAG[ko]"` or `pip install kiwipiepy` |
| `ImportError: You need to install konlpy` | konlpy not installed | `pip install "AutoRAG[ko]"` or `pip install konlpy` |
| `UnicodeDecodeError` in Kiwi tokenization | Malformed Unicode in input | Handled internally by `extract_form_safe()` which returns a space on error |
| Java-related errors from KoNLPy | JDK not installed | Install JDK 8+ (e.g., `apt install default-jdk` on Ubuntu) |
Compatibility Notes
- Kiwi vs KoNLPy: Kiwi (`ko_kiwi`) is pure Python with no Java dependency. KoNLPy (`ko_kkma`, `ko_okt`) requires Java Runtime.
- Fallback: If Korean tokenizers are not installed, you can still use `porter_stemmer`, `space`, or HuggingFace tokenizers for BM25 but retrieval quality on Korean text will be degraded.
- Unicode safety: The Kiwi tokenizer includes a `UnicodeDecodeError` handler that replaces problematic tokens with spaces rather than crashing.