Principle:Infiniflow Ragflow Keyword Extraction

Knowledge Sources	RAGFlow
Domains	RAG, NLP, Information_Retrieval
Last Updated	2026-02-12 06:00 GMT

Overview

A text analysis technique that tokenizes chunk content for BM25 keyword search and optionally extracts important keywords via LLM.

Description

Keyword Extraction and Tokenization prepares document chunks for sparse (keyword-based) retrieval. The RAGFlow tokenizer handles Chinese-English bilingual text with custom segmentation rules. Tokenized content is stored as content_ltks (space-separated tokens) for BM25 matching. Optionally, an LLM can extract the most important keywords (auto_keywords config) which are stored as important_kwd for boosted matching.

Usage

Operates automatically during document processing after embedding generation. The tokenizer behavior adapts based on the document store engine (Elasticsearch uses custom tokenization; Infinity returns text unchanged).

Theoretical Basis

Keyword search complements vector search:

BM25/TF-IDF: Statistical term frequency models that excel at exact match retrieval
Tokenization: Chinese text requires word segmentation; English requires stemming and stop-word handling
LLM keyword extraction: Using a chat model to identify the most important terms provides semantic keyword selection

Related Pages

Implemented By

Implementation:Infiniflow_Ragflow_Rag_Tokenizer_Tokenize

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment