Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Infiniflow Ragflow Rag Tokenizer Tokenize

From Leeroopedia
Revision as of 11:22, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Infiniflow_Ragflow_Rag_Tokenizer_Tokenize.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains RAG, NLP
Last Updated 2026-02-12 06:00 GMT

Overview

Concrete tool for tokenizing text for BM25 keyword search provided by RAGFlow's RagTokenizer.

Description

The rag_tokenizer.tokenize function converts text into space-separated tokens suitable for BM25 search indexing. When DOC_ENGINE_INFINITY is True (Infinity backend), tokenization is a no-op (returns input unchanged) since Infinity handles its own tokenization. For Elasticsearch, the function delegates to the parent class infinity.rag_tokenizer.RagTokenizer which handles Chinese word segmentation, English stemming, and mixed-language text.

Usage

Called during document processing to prepare chunk text for keyword search indexing.

Code Reference

Source Location

  • Repository: ragflow
  • File: rag/nlp/rag_tokenizer.py
  • Lines: L20-32 (tokenize, fine_grained_tokenize), L51-57 (module exports)

Signature

class RagTokenizer(InfinityRagTokenizer):
    def tokenize(self, line: str) -> str:
        """Tokenize text for keyword search.

        If DOC_ENGINE_INFINITY is True, returns line unchanged.
        Otherwise delegates to parent class tokenizer.

        Args:
            line: str - Text to tokenize.
        Returns:
            str - Space-separated tokens.
        """

    def fine_grained_tokenize(self, tks: str) -> str:
        """Fine-grained tokenization of already-tokenized text.

        Args:
            tks: str - Space-separated tokens.
        Returns:
            str - More finely segmented tokens.
        """

# Module-level exports
tokenizer = RagTokenizer()
tokenize = tokenizer.tokenize
fine_grained_tokenize = tokenizer.fine_grained_tokenize

Import

from rag.nlp import rag_tokenizer
# or
from rag.nlp.rag_tokenizer import tokenize, fine_grained_tokenize

I/O Contract

Inputs

Name Type Required Description
line str Yes Text to tokenize

Outputs

Name Type Description
tokens str Space-separated token string

Usage Examples

from rag.nlp.rag_tokenizer import tokenize

# Tokenize for keyword search
tokens = tokenize("RAGFlow is an open-source RAG engine")
print(tokens)  # "ragflow is an open source rag engine"

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment