Workflow:Infiniflow Ragflow Knowledge Base Document Ingestion
| Knowledge Sources | |
|---|---|
| Domains | RAG, Knowledge_Management, Document_Processing |
| Last Updated | 2026-02-12 06:00 GMT |
Overview
End-to-end process for creating a knowledge base in RAGFlow, uploading documents, configuring parsing methods, and triggering document ingestion to build a searchable knowledge store.
Description
This workflow covers the complete journey of setting up a knowledge base (dataset) within RAGFlow. A knowledge base is the fundamental unit of organization for documents and their processed chunks. The process involves creating the knowledge base with an embedding model and language selection, uploading documents in various formats (PDF, Word, Excel, images, etc.), selecting an appropriate chunking method (naive, book, paper, QA, knowledge graph, etc.), configuring parser-specific options (layout recognition, OCR, delimiter settings), and triggering document processing to parse, chunk, embed, and index the content into the document store (Elasticsearch or Infinity). Once ingested, the knowledge base becomes available for retrieval in chat applications, search applications, and agent workflows.
Usage
Execute this workflow when you need to build a searchable knowledge repository from a collection of documents. This is the foundational step before creating any RAG-powered application in RAGFlow. Use it when you have documents (PDFs, Word files, spreadsheets, images, or other formats) that you want to make available for AI-powered question answering, search, or agent-based retrieval.
Execution Steps
Step 1: Create Knowledge Base
Create a new knowledge base (dataset) by providing a name, selecting an embedding model, and choosing the primary language. The embedding model determines how text chunks will be vectorized for semantic search. The language setting influences tokenization and text processing behavior. Optionally configure permissions (team-level or personal) and description.
Key considerations:
- Choose an embedding model that matches your document language and domain
- The language setting affects tokenization quality for CJK vs. Latin text
- Once set, the embedding model cannot be easily changed without re-processing all documents
Step 2: Upload Documents
Upload one or more documents to the knowledge base. RAGFlow supports a wide range of file formats including PDF, DOCX, XLSX, PPTX, TXT, CSV, images (PNG, JPG), markdown, HTML, and more. Documents are stored in the object storage backend (MinIO/S3) and registered in the database with metadata tracking.
Key considerations:
- Supported formats include PDF, Word, Excel, PowerPoint, images, plain text, markdown, HTML, and more
- File size limits and per-user file count limits may apply depending on configuration
- Documents can also be linked from the File Manager or synced from external data sources (S3, Confluence, Notion, Google Drive)
Step 3: Configure Chunking Method
Select the appropriate chunking method (parser) for each document or set a default for the knowledge base. RAGFlow provides multiple template-based chunking strategies optimized for different document types:
Available methods:
- Naive (General): Configurable delimiter-based splitting with layout recognition
- Book: Optimized for long-form book content with TOC enhancement
- Paper: Specialized for academic papers with section-aware parsing
- Laws: Legal document parsing with clause structure preservation
- Presentation: Slide deck parsing
- QA: Extracts question-answer pairs directly
- Table: Structured data extraction
- Knowledge Graph: Entity and relationship extraction for graph-based retrieval
- Tag: Tag-based categorization chunking
- One: Entire document as a single chunk
Key considerations:
- Each method has specific configuration options (max token count, layout recognition, OCR settings)
- The naive method is the most flexible and suitable for general documents
- Knowledge graph method requires an LLM for entity extraction
Step 4: Configure Parser Options
Fine-tune the parser configuration for the selected chunking method. Common options include maximum token count per chunk, layout recognition model selection, OCR settings (PaddleOCR or MinerU), delimiter customization, RAPTOR hierarchical summarization, auto-keyword generation, and cross-language query support.
Key considerations:
- Max token count controls chunk granularity (larger chunks provide more context, smaller chunks improve precision)
- Layout recognition helps extract structured content from PDFs and images
- RAPTOR creates hierarchical summaries for improved retrieval on broad queries
- Auto-keywords and auto-questions can enrich chunks for better search recall
Step 5: Trigger Document Processing
Start the document processing pipeline by triggering the "run" action on uploaded documents. This queues tasks for the background task executor workers, which will parse, chunk, embed, and index the documents. Progress can be monitored through the UI with real-time status updates and processing logs.
Key considerations:
- Processing is asynchronous and handled by background task executor workers
- Multiple documents can be processed concurrently depending on worker configuration
- Failed tasks are retried up to 3 times before being marked as abandoned
- Processing progress and logs are tracked in Redis and displayed in the UI
Step 6: Verify and Test Retrieval
After processing completes, verify the results using the built-in retrieval testing feature. Enter test queries to check that relevant chunks are being retrieved with appropriate similarity scores. Review chunk content and adjust parser settings if needed. The testing interface allows configuring similarity threshold, Top-N results, reranking model, and keyword vs. semantic search weights.
Key considerations:
- Use the retrieval testing tab within the knowledge base to validate chunk quality
- Adjust similarity threshold and Top-N parameters to optimize recall and precision
- Consider enabling a reranking model for improved result ordering
- If results are poor, revisit chunking method and parser configuration