Workflow:Langgenius Dify Knowledge Base Management
| Knowledge Sources | |
|---|---|
| Domains | RAG, Knowledge_Management, Data_Engineering |
| Last Updated | 2026-02-12 07:00 GMT |
Overview
End-to-end process for creating and managing knowledge bases (datasets) in Dify, from document upload through text chunking and embedding to retrieval quality testing.
Description
This workflow covers the complete lifecycle of knowledge base management in Dify's Retrieval-Augmented Generation (RAG) system. Users create datasets, upload documents from various sources (files, web pages, Notion), configure chunking strategies, select embedding models, monitor indexing progress, manage document segments, and test retrieval quality. The system supports multiple chunking modes (automatic, custom, parent-child), configurable embedding providers, and hybrid retrieval strategies combining keyword and semantic search with optional reranking.
Usage
Execute this workflow when you need to give your LLM applications access to domain-specific knowledge. This is the workflow for building the knowledge layer that powers retrieval-augmented generation in chatbots, agents, and workflows. Use it when you have documents, web content, or structured data that should be searchable by your AI applications.
Execution Steps
Step 1: Create Dataset
Create a new dataset (knowledge base) with a name and optional description. Choose between creating an empty dataset for later population or uploading documents immediately during creation. Configure the indexing method and embedding model that will be used for all documents in this dataset.
Dataset configuration:
- Name and description for identification
- Indexing technique: high-quality (vector + keyword) or economical (keyword only)
- Embedding model selection from configured providers
- Permission settings: only-me or all-team-members
Step 2: Upload and Add Documents
Add documents to the dataset from supported data sources. The platform accepts file uploads (PDF, TXT, Markdown, DOCX, HTML, CSV, and more), web page URLs for crawling, and Notion page imports. Each document is queued for asynchronous processing.
Data sources:
- File upload: Direct upload of supported document formats
- Web crawling: Import content from URLs
- Notion sync: Import pages from connected Notion workspaces
- Batch upload of multiple files simultaneously
- Metadata assignment for document-level attributes
Step 3: Configure Chunking Strategy
Select and configure the text chunking strategy that determines how documents are split into retrievable segments. Different strategies suit different content types and retrieval patterns.
Chunking modes:
- Automatic: Platform selects optimal chunk size and overlap
- Custom: Manually specify chunk size, overlap, and separator rules
- Parent-child (hierarchical): Create nested chunks where parent chunks provide context and child chunks provide precision
- QA mode: Generate question-answer pairs from document content for structured retrieval
Parameters:
- Maximum chunk size (tokens)
- Chunk overlap for context continuity
- Separator rules for splitting boundaries
- Pre-processing rules (whitespace, URL handling)
Step 4: Monitor Indexing Progress
Track the asynchronous document processing pipeline as it progresses through parsing, chunking, embedding, and indexing stages. The platform processes documents via Celery task queues with priority support for urgent updates.
Processing stages:
- Document parsing and text extraction
- Text chunking according to configured strategy
- Embedding generation via the selected model
- Vector index insertion into the configured vector database
- Keyword index building for hybrid retrieval
Monitoring:
- Per-document indexing status (queued, processing, completed, failed, paused)
- Word count and segment count after processing
- Ability to pause and resume indexing
- Re-indexing for configuration changes
Step 5: Manage Segments and Metadata
Review and manage individual text segments after indexing. Edit segment content, add or modify metadata, enable or disable specific segments, and manage document-level metadata fields. Segments can be searched, filtered, and individually toggled for retrieval.
Segment management:
- Browse segments with keyword search and filtering
- Edit segment content and metadata
- Enable or disable individual segments
- Add custom metadata fields to documents
- Batch operations for bulk segment management
Step 6: Test Retrieval Quality
Use the built-in hit testing feature to evaluate retrieval quality. Submit test queries and examine which segments are retrieved, their relevance scores, and their ranking order. Adjust retrieval parameters based on test results.
Hit testing:
- Submit natural language queries
- View retrieved segments with relevance scores
- Compare results across retrieval methods (keyword, semantic, hybrid)
- Adjust top-K and score threshold parameters
- Validate that important content is retrievable