Workflow:Langgenius Dify Knowledge Base Management

Knowledge Sources	Dify Dify Docs
Domains	RAG, Knowledge_Management, Data_Engineering
Last Updated	2026-02-12 07:00 GMT

Overview

End-to-end process for creating and managing knowledge bases (datasets) in Dify, from document upload through text chunking and embedding to retrieval quality testing.

Description

This workflow covers the complete lifecycle of knowledge base management in Dify's Retrieval-Augmented Generation (RAG) system. Users create datasets, upload documents from various sources (files, web pages, Notion), configure chunking strategies, select embedding models, monitor indexing progress, manage document segments, and test retrieval quality. The system supports multiple chunking modes (automatic, custom, parent-child), configurable embedding providers, and hybrid retrieval strategies combining keyword and semantic search with optional reranking.

Usage

Execute this workflow when you need to give your LLM applications access to domain-specific knowledge. This is the workflow for building the knowledge layer that powers retrieval-augmented generation in chatbots, agents, and workflows. Use it when you have documents, web content, or structured data that should be searchable by your AI applications.

Execution Steps

Step 1: Create Dataset

Create a new dataset (knowledge base) with a name and optional description. Choose between creating an empty dataset for later population or uploading documents immediately during creation. Configure the indexing method and embedding model that will be used for all documents in this dataset.

Dataset configuration:

Name and description for identification
Indexing technique: high-quality (vector + keyword) or economical (keyword only)
Embedding model selection from configured providers
Permission settings: only-me or all-team-members

Step 2: Upload and Add Documents

Add documents to the dataset from supported data sources. The platform accepts file uploads (PDF, TXT, Markdown, DOCX, HTML, CSV, and more), web page URLs for crawling, and Notion page imports. Each document is queued for asynchronous processing.

Data sources:

File upload: Direct upload of supported document formats
Web crawling: Import content from URLs
Notion sync: Import pages from connected Notion workspaces
Batch upload of multiple files simultaneously
Metadata assignment for document-level attributes

Step 3: Configure Chunking Strategy

Select and configure the text chunking strategy that determines how documents are split into retrievable segments. Different strategies suit different content types and retrieval patterns.

Chunking modes:

Automatic: Platform selects optimal chunk size and overlap
Custom: Manually specify chunk size, overlap, and separator rules
Parent-child (hierarchical): Create nested chunks where parent chunks provide context and child chunks provide precision
QA mode: Generate question-answer pairs from document content for structured retrieval

Parameters:

Maximum chunk size (tokens)
Chunk overlap for context continuity
Separator rules for splitting boundaries
Pre-processing rules (whitespace, URL handling)

Step 4: Monitor Indexing Progress

Track the asynchronous document processing pipeline as it progresses through parsing, chunking, embedding, and indexing stages. The platform processes documents via Celery task queues with priority support for urgent updates.

Processing stages:

Document parsing and text extraction
Text chunking according to configured strategy
Embedding generation via the selected model
Vector index insertion into the configured vector database
Keyword index building for hybrid retrieval

Monitoring:

Per-document indexing status (queued, processing, completed, failed, paused)
Word count and segment count after processing
Ability to pause and resume indexing
Re-indexing for configuration changes

Step 5: Manage Segments and Metadata

Review and manage individual text segments after indexing. Edit segment content, add or modify metadata, enable or disable specific segments, and manage document-level metadata fields. Segments can be searched, filtered, and individually toggled for retrieval.

Segment management:

Browse segments with keyword search and filtering
Edit segment content and metadata
Enable or disable individual segments
Add custom metadata fields to documents
Batch operations for bulk segment management

Step 6: Test Retrieval Quality

Use the built-in hit testing feature to evaluate retrieval quality. Submit test queries and examine which segments are retrieved, their relevance scores, and their ranking order. Adjust retrieval parameters based on test results.

Hit testing:

Submit natural language queries
View retrieved segments with relevance scores
Compare results across retrieval methods (keyword, semantic, hybrid)
Adjust top-K and score threshold parameters
Validate that important content is retrievable

Execution Diagram

GitHub URL

Workflow Repository