Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Infiniflow Ragflow Document Processing Pipeline

From Leeroopedia
Knowledge Sources
Domains RAG, Document_Processing, NLP, OCR
Last Updated 2026-02-12 06:00 GMT

Overview

End-to-end backend pipeline that transforms raw documents into searchable, embedded chunks stored in the document store, encompassing PDF parsing, layout analysis, OCR, text chunking, embedding generation, and indexing.

Description

This workflow describes the internal document processing pipeline that executes when a user triggers document processing in RAGFlow. It covers the complete transformation from raw uploaded files to indexed, searchable chunks. The pipeline is executed by background task executor workers that consume tasks from a Redis-based queue. Each document goes through format-specific parsing (PDF layout analysis, OCR, Office conversion), template-based chunking according to the configured parser method, embedding generation using the knowledge base's embedding model, and finally indexing into the document store (Elasticsearch, Infinity, or OpenSearch). The pipeline also supports advanced features like RAPTOR hierarchical summarization, knowledge graph extraction, and PageRank scoring.

Usage

This workflow executes automatically when documents are queued for processing in a knowledge base. Understanding this pipeline is essential for developers who need to debug document processing issues, optimize chunk quality, add new parser types, or extend the embedding and indexing infrastructure.

Execution Steps

Step 1: Task Queue Consumption

Background task executor workers poll the Redis-based task queue for pending document processing tasks. Each worker is identified by a host ID and consumer number. Tasks are distributed across workers using Redis consumer groups, ensuring each task is processed exactly once. The worker retrieves the task details including document metadata, knowledge base configuration, parser settings, and embedding model information.

Key considerations:

  • Workers use Redis consumer groups (SVR_CONSUMER_GROUP_NAME) for reliable task distribution
  • Each worker runs as a separate process with jemalloc memory optimization
  • Failed tasks are retried up to 3 times before being marked as abandoned
  • Task progress is tracked in both Redis (real-time) and MySQL (persistent)

Step 2: Document Parsing

The raw document is retrieved from object storage (MinIO/S3) and parsed according to its file type. PDF documents go through the DeepDoc pipeline which includes layout analysis (detecting text blocks, tables, figures, headers) using YOLO-based models, and OCR for scanned content using PaddleOCR or MinerU. Office documents (DOCX, XLSX, PPTX) are converted and parsed using specialized parsers. Image files are processed with OCR and optional multi-modal LLM analysis. Audio files are transcribed using ASR models.

What happens:

  • File is downloaded from object storage to a temporary location
  • Format-specific parser is invoked based on file extension and parser_id configuration
  • PDF parsing includes layout recognition, table extraction, and figure detection
  • OCR is applied to scanned pages and embedded images
  • Output is structured text with positional metadata

Step 3: Text Chunking

The parsed text is split into chunks according to the configured chunking method (parser). Each method implements a different splitting strategy optimized for specific document types. The naive parser uses delimiter-based splitting with configurable token limits. The book parser respects chapter structure. The paper parser follows academic paper sections. Each chunk is assigned metadata including position, page number, and structural context.

Key considerations:

  • Chunk size is controlled by max_token_num configuration
  • Delimiters can be customized per knowledge base or document
  • Children delimiter allows hierarchical sub-chunking within main chunks
  • Token counting uses the tiktoken library for accurate estimates

Step 4: Embedding Generation

Each text chunk is converted to a dense vector representation using the embedding model configured for the knowledge base. RAGFlow supports 66+ LLM providers through a factory pattern, each with specific embedding model options. Embeddings are generated in batches for efficiency. The embedding model must match the language and domain of the documents for optimal retrieval quality.

Key considerations:

  • Embedding model is set at the knowledge base level and shared across all documents
  • Batch processing improves throughput for large document sets
  • Different embedding models produce vectors of different dimensions
  • Both the chunk text and its embedding are stored together

Step 5: Keyword Extraction and Tokenization

Alongside embedding generation, each chunk undergoes keyword extraction and tokenization for hybrid search support. The RAGFlow tokenizer produces fine-grained tokens used for keyword-based (BM25) search. Optional features include auto-generated keywords, auto-generated questions, and cross-language query tokens. These tokens enable the hybrid search strategy that combines semantic similarity with keyword matching.

Key considerations:

  • RAGFlow uses a custom tokenizer (rag_tokenizer) optimized for both English and Chinese text
  • Auto-keywords and auto-questions use an LLM to enrich chunk metadata
  • Cross-language tokens enable querying in one language and retrieving in another
  • PageRank scoring can be applied to weight chunks by importance

Step 6: Document Store Indexing

The processed chunks with their embeddings, tokens, and metadata are bulk-inserted into the document store. RAGFlow supports three backends: Elasticsearch (default), Infinity, and OpenSearch. The index mapping defines fields for content text, dense vectors, keyword tokens, metadata fields, and structural information. Chunks are indexed with the knowledge base ID and document ID for efficient filtering during retrieval.

Key considerations:

  • Index mappings are defined in configuration files (mapping.json, infinity_mapping.json, os_mapping.json)
  • Bulk insertion is used for efficient indexing of multiple chunks
  • The document store must be properly configured with the correct vector dimensions
  • Existing chunks for a document are removed before re-indexing (idempotent processing)

Step 7: Progress Finalization

After all chunks are indexed, the task executor updates the document status with final progress, chunk count, token count, and processing duration. The document status transitions from "running" to "done" (or "fail" if errors occurred). A background progress updater thread in the main server periodically aggregates task-level progress into document-level progress for UI display.

What happens:

  • Task progress is set to 1.0 (100%) on success or -1 on failure
  • Document record is updated with chunk_num, token_num, and process_duration
  • Processing logs are stored in Redis with a 30-minute TTL
  • UI receives real-time progress updates through polling

Execution Diagram

GitHub URL

Workflow Repository