Implementation:Infiniflow Ragflow Rag Tokenizer Tokenize
| Knowledge Sources | |
|---|---|
| Domains | RAG, NLP |
| Last Updated | 2026-02-12 06:00 GMT |
Overview
Concrete tool for tokenizing text for BM25 keyword search provided by RAGFlow's RagTokenizer.
Description
The rag_tokenizer.tokenize function converts text into space-separated tokens suitable for BM25 search indexing. When DOC_ENGINE_INFINITY is True (Infinity backend), tokenization is a no-op (returns input unchanged) since Infinity handles its own tokenization. For Elasticsearch, the function delegates to the parent class infinity.rag_tokenizer.RagTokenizer which handles Chinese word segmentation, English stemming, and mixed-language text.
Usage
Called during document processing to prepare chunk text for keyword search indexing.
Code Reference
Source Location
- Repository: ragflow
- File: rag/nlp/rag_tokenizer.py
- Lines: L20-32 (tokenize, fine_grained_tokenize), L51-57 (module exports)
Signature
class RagTokenizer(InfinityRagTokenizer):
def tokenize(self, line: str) -> str:
"""Tokenize text for keyword search.
If DOC_ENGINE_INFINITY is True, returns line unchanged.
Otherwise delegates to parent class tokenizer.
Args:
line: str - Text to tokenize.
Returns:
str - Space-separated tokens.
"""
def fine_grained_tokenize(self, tks: str) -> str:
"""Fine-grained tokenization of already-tokenized text.
Args:
tks: str - Space-separated tokens.
Returns:
str - More finely segmented tokens.
"""
# Module-level exports
tokenizer = RagTokenizer()
tokenize = tokenizer.tokenize
fine_grained_tokenize = tokenizer.fine_grained_tokenize
Import
from rag.nlp import rag_tokenizer
# or
from rag.nlp.rag_tokenizer import tokenize, fine_grained_tokenize
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| line | str | Yes | Text to tokenize |
Outputs
| Name | Type | Description |
|---|---|---|
| tokens | str | Space-separated token string |
Usage Examples
from rag.nlp.rag_tokenizer import tokenize
# Tokenize for keyword search
tokens = tokenize("RAGFlow is an open-source RAG engine")
print(tokens) # "ragflow is an open source rag engine"