Workflow:Langgenius Dify Knowledge Base Creation
| Knowledge Sources | |
|---|---|
| Domains | RAG, Knowledge_Management, LLMs |
| Last Updated | 2026-02-08 14:00 GMT |
Overview
End-to-end process for creating a Knowledge Base (RAG pipeline) in Dify, from data source ingestion through chunking, embedding, and retrieval configuration.
Description
This workflow covers the complete Knowledge Base creation lifecycle in Dify. Users select a data source (file upload, Notion sync, or web scraping), configure document chunking strategies (general, Q&A, parent-child, or graph modes), choose an indexing method (high-quality embedding or economy keyword-based), and set up retrieval parameters. The result is a searchable knowledge base that can be connected to chatbots, agents, and workflows for Retrieval-Augmented Generation.
Usage
Execute this workflow when you have domain-specific documents (PDFs, text files, web pages, or Notion content) that you want to make available to LLM applications as a knowledge source. You need at least one configured embedding model provider for high-quality indexing, or can use the economy mode with keyword-based retrieval.
Execution Steps
Step 1: Data Source Selection
Choose the source of documents for the knowledge base. Dify supports multiple ingestion methods that each handle different content origins. File upload accepts drag-and-drop of common document formats. Notion sync connects to a Notion workspace via API integration. Web scraping uses configurable crawler providers to extract content from URLs.
Supported data sources:
- File Upload — Drag and drop or browse for PDFs, text files, DOCX, and other supported formats
- Notion Sync — Connect to a Notion workspace and select pages to import
- Website Crawl — Crawl web pages using Firecrawl, Jina Reader, or Watercrawl providers
- Empty Knowledge — Create the knowledge base first and add documents later
Step 2: Document Processing Configuration
Configure how ingested documents are split into retrievable chunks. Select a chunking mode that matches the content structure, set chunk size and overlap parameters, and define text preprocessing rules.
Chunking modes:
- General (text_model) — Standard paragraph-based splitting for most document types
- Q&A (qa_model) — Pair-based format for FAQ-style content (requires high-quality indexing)
- Parent-Child (hierarchical_model) — Multi-level hierarchical chunks with parent context preservation
- Graph (graph_model) — Reserved for future Graph RAG capabilities
Preprocessing options:
- Replace consecutive whitespace characters
- Remove stopwords
- Delete URLs and email addresses
Step 3: Indexing Method Selection
Choose the indexing strategy that balances accuracy, cost, and resource requirements. High-quality mode uses embedding models to create vector representations of each chunk. Economy mode extracts keywords without consuming embedding tokens.
Indexing methods:
- High Quality — Vector embeddings via configured model provider (higher accuracy, consumes tokens)
- Economy — Keyword extraction with 10 keywords per chunk (no token cost, lower accuracy)
Step 4: Embedding and Indexing Execution
The system processes all documents through the configured pipeline. Documents transition through states: queuing, indexing, and completed (or error). Users can pause, resume, or cancel processing. Token and chunk count estimates are provided before execution begins.
What happens:
- Documents are queued for processing
- Each document is split into chunks per the configured strategy
- Chunks are embedded (high-quality) or keyword-indexed (economy)
- Processing status is tracked per document with progress indicators
Step 5: Retrieval Configuration
Configure how the knowledge base responds to queries. Select the retrieval method, set the number of results to return (top_k), define a similarity score threshold, and optionally enable a reranking model for improved result quality.
Retrieval methods:
- Semantic Search — Vector similarity matching
- Full-Text Search — Text-based matching
- Hybrid Search — Combines semantic and full-text (recommended)
- Keyword Search — Inverted index matching
Tunable parameters:
- Top K (number of chunks to retrieve)
- Score threshold (minimum similarity cutoff)
- Rerank model selection and configuration
- Weighted score balance between semantic and keyword results
Step 6: Knowledge Base Verification
Test the knowledge base with sample queries using the built-in hit testing interface. Review retrieved chunks, their relevance scores, and source attribution. Iterate on chunking and retrieval settings if results are unsatisfactory.
Verification capabilities:
- Query input with real-time retrieval results
- Chunk content preview with relevance scores
- Retrieval history for comparing configuration changes
- Source document attribution