Principle:Infiniflow Ragflow Keyword Extraction
| Knowledge Sources | |
|---|---|
| Domains | RAG, NLP, Information_Retrieval |
| Last Updated | 2026-02-12 06:00 GMT |
Overview
A text analysis technique that tokenizes chunk content for BM25 keyword search and optionally extracts important keywords via LLM.
Description
Keyword Extraction and Tokenization prepares document chunks for sparse (keyword-based) retrieval. The RAGFlow tokenizer handles Chinese-English bilingual text with custom segmentation rules. Tokenized content is stored as content_ltks (space-separated tokens) for BM25 matching. Optionally, an LLM can extract the most important keywords (auto_keywords config) which are stored as important_kwd for boosted matching.
Usage
Operates automatically during document processing after embedding generation. The tokenizer behavior adapts based on the document store engine (Elasticsearch uses custom tokenization; Infinity returns text unchanged).
Theoretical Basis
Keyword search complements vector search:
- BM25/TF-IDF: Statistical term frequency models that excel at exact match retrieval
- Tokenization: Chinese text requires word segmentation; English requires stemming and stop-word handling
- LLM keyword extraction: Using a chat model to identify the most important terms provides semantic keyword selection