Implementation:Infiniflow Ragflow Rag Tokenizer Tokenize

Knowledge Sources	RAGFlow
Domains	RAG, NLP
Last Updated	2026-02-12 06:00 GMT

Overview

Concrete tool for tokenizing text for BM25 keyword search provided by RAGFlow's RagTokenizer.

Description

The rag_tokenizer.tokenize function converts text into space-separated tokens suitable for BM25 search indexing. When DOC_ENGINE_INFINITY is True (Infinity backend), tokenization is a no-op (returns input unchanged) since Infinity handles its own tokenization. For Elasticsearch, the function delegates to the parent class infinity.rag_tokenizer.RagTokenizer which handles Chinese word segmentation, English stemming, and mixed-language text.

Usage

Called during document processing to prepare chunk text for keyword search indexing.

Code Reference

Source Location

Repository: ragflow
File: rag/nlp/rag_tokenizer.py
Lines: L20-32 (tokenize, fine_grained_tokenize), L51-57 (module exports)

Signature

class RagTokenizer(InfinityRagTokenizer):
    def tokenize(self, line: str) -> str:
        """Tokenize text for keyword search.

        If DOC_ENGINE_INFINITY is True, returns line unchanged.
        Otherwise delegates to parent class tokenizer.

        Args:
            line: str - Text to tokenize.
        Returns:
            str - Space-separated tokens.
        """

    def fine_grained_tokenize(self, tks: str) -> str:
        """Fine-grained tokenization of already-tokenized text.

        Args:
            tks: str - Space-separated tokens.
        Returns:
            str - More finely segmented tokens.
        """

# Module-level exports
tokenizer = RagTokenizer()
tokenize = tokenizer.tokenize
fine_grained_tokenize = tokenizer.fine_grained_tokenize

Import

from rag.nlp import rag_tokenizer
# or
from rag.nlp.rag_tokenizer import tokenize, fine_grained_tokenize

I/O Contract

Inputs

Name	Type	Required	Description
line	str	Yes	Text to tokenize

Outputs

Name	Type	Description
tokens	str	Space-separated token string

Usage Examples

from rag.nlp.rag_tokenizer import tokenize

# Tokenize for keyword search
tokens = tokenize("RAGFlow is an open-source RAG engine")
print(tokens)  # "ragflow is an open source rag engine"

Related Pages

Implements Principle

Principle:Infiniflow_Ragflow_Keyword_Extraction

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment