Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Infiniflow Ragflow Knowledge Base Document Ingestion

From Leeroopedia
Knowledge Sources
Domains RAG, Knowledge_Management, Document_Processing
Last Updated 2026-02-12 06:00 GMT

Overview

End-to-end process for creating a knowledge base in RAGFlow, uploading documents, configuring parsing methods, and triggering document ingestion to build a searchable knowledge store.

Description

This workflow covers the complete journey of setting up a knowledge base (dataset) within RAGFlow. A knowledge base is the fundamental unit of organization for documents and their processed chunks. The process involves creating the knowledge base with an embedding model and language selection, uploading documents in various formats (PDF, Word, Excel, images, etc.), selecting an appropriate chunking method (naive, book, paper, QA, knowledge graph, etc.), configuring parser-specific options (layout recognition, OCR, delimiter settings), and triggering document processing to parse, chunk, embed, and index the content into the document store (Elasticsearch or Infinity). Once ingested, the knowledge base becomes available for retrieval in chat applications, search applications, and agent workflows.

Usage

Execute this workflow when you need to build a searchable knowledge repository from a collection of documents. This is the foundational step before creating any RAG-powered application in RAGFlow. Use it when you have documents (PDFs, Word files, spreadsheets, images, or other formats) that you want to make available for AI-powered question answering, search, or agent-based retrieval.

Execution Steps

Step 1: Create Knowledge Base

Create a new knowledge base (dataset) by providing a name, selecting an embedding model, and choosing the primary language. The embedding model determines how text chunks will be vectorized for semantic search. The language setting influences tokenization and text processing behavior. Optionally configure permissions (team-level or personal) and description.

Key considerations:

  • Choose an embedding model that matches your document language and domain
  • The language setting affects tokenization quality for CJK vs. Latin text
  • Once set, the embedding model cannot be easily changed without re-processing all documents

Step 2: Upload Documents

Upload one or more documents to the knowledge base. RAGFlow supports a wide range of file formats including PDF, DOCX, XLSX, PPTX, TXT, CSV, images (PNG, JPG), markdown, HTML, and more. Documents are stored in the object storage backend (MinIO/S3) and registered in the database with metadata tracking.

Key considerations:

  • Supported formats include PDF, Word, Excel, PowerPoint, images, plain text, markdown, HTML, and more
  • File size limits and per-user file count limits may apply depending on configuration
  • Documents can also be linked from the File Manager or synced from external data sources (S3, Confluence, Notion, Google Drive)

Step 3: Configure Chunking Method

Select the appropriate chunking method (parser) for each document or set a default for the knowledge base. RAGFlow provides multiple template-based chunking strategies optimized for different document types:

Available methods:

  • Naive (General): Configurable delimiter-based splitting with layout recognition
  • Book: Optimized for long-form book content with TOC enhancement
  • Paper: Specialized for academic papers with section-aware parsing
  • Laws: Legal document parsing with clause structure preservation
  • Presentation: Slide deck parsing
  • QA: Extracts question-answer pairs directly
  • Table: Structured data extraction
  • Knowledge Graph: Entity and relationship extraction for graph-based retrieval
  • Tag: Tag-based categorization chunking
  • One: Entire document as a single chunk

Key considerations:

  • Each method has specific configuration options (max token count, layout recognition, OCR settings)
  • The naive method is the most flexible and suitable for general documents
  • Knowledge graph method requires an LLM for entity extraction

Step 4: Configure Parser Options

Fine-tune the parser configuration for the selected chunking method. Common options include maximum token count per chunk, layout recognition model selection, OCR settings (PaddleOCR or MinerU), delimiter customization, RAPTOR hierarchical summarization, auto-keyword generation, and cross-language query support.

Key considerations:

  • Max token count controls chunk granularity (larger chunks provide more context, smaller chunks improve precision)
  • Layout recognition helps extract structured content from PDFs and images
  • RAPTOR creates hierarchical summaries for improved retrieval on broad queries
  • Auto-keywords and auto-questions can enrich chunks for better search recall

Step 5: Trigger Document Processing

Start the document processing pipeline by triggering the "run" action on uploaded documents. This queues tasks for the background task executor workers, which will parse, chunk, embed, and index the documents. Progress can be monitored through the UI with real-time status updates and processing logs.

Key considerations:

  • Processing is asynchronous and handled by background task executor workers
  • Multiple documents can be processed concurrently depending on worker configuration
  • Failed tasks are retried up to 3 times before being marked as abandoned
  • Processing progress and logs are tracked in Redis and displayed in the UI

Step 6: Verify and Test Retrieval

After processing completes, verify the results using the built-in retrieval testing feature. Enter test queries to check that relevant chunks are being retrieved with appropriate similarity scores. Review chunk content and adjust parser settings if needed. The testing interface allows configuring similarity threshold, Top-N results, reranking model, and keyword vs. semantic search weights.

Key considerations:

  • Use the retrieval testing tab within the knowledge base to validate chunk quality
  • Adjust similarity threshold and Top-N parameters to optimize recall and precision
  • Consider enabling a reranking model for improved result ordering
  • If results are poor, revisit chunking method and parser configuration

Execution Diagram

GitHub URL

Workflow Repository