Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Infiniflow Ragflow Keyword Extraction

From Leeroopedia
Knowledge Sources
Domains RAG, NLP, Information_Retrieval
Last Updated 2026-02-12 06:00 GMT

Overview

A text analysis technique that tokenizes chunk content for BM25 keyword search and optionally extracts important keywords via LLM.

Description

Keyword Extraction and Tokenization prepares document chunks for sparse (keyword-based) retrieval. The RAGFlow tokenizer handles Chinese-English bilingual text with custom segmentation rules. Tokenized content is stored as content_ltks (space-separated tokens) for BM25 matching. Optionally, an LLM can extract the most important keywords (auto_keywords config) which are stored as important_kwd for boosted matching.

Usage

Operates automatically during document processing after embedding generation. The tokenizer behavior adapts based on the document store engine (Elasticsearch uses custom tokenization; Infinity returns text unchanged).

Theoretical Basis

Keyword search complements vector search:

  • BM25/TF-IDF: Statistical term frequency models that excel at exact match retrieval
  • Tokenization: Chinese text requires word segmentation; English requires stemming and stop-word handling
  • LLM keyword extraction: Using a chat model to identify the most important terms provides semantic keyword selection

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment