Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Langgenius Dify Knowledge Base Management

From Leeroopedia
Knowledge Sources
Domains RAG, Knowledge_Management, Data_Engineering
Last Updated 2026-02-12 07:00 GMT

Overview

End-to-end process for creating and managing knowledge bases (datasets) in Dify, from document upload through text chunking and embedding to retrieval quality testing.

Description

This workflow covers the complete lifecycle of knowledge base management in Dify's Retrieval-Augmented Generation (RAG) system. Users create datasets, upload documents from various sources (files, web pages, Notion), configure chunking strategies, select embedding models, monitor indexing progress, manage document segments, and test retrieval quality. The system supports multiple chunking modes (automatic, custom, parent-child), configurable embedding providers, and hybrid retrieval strategies combining keyword and semantic search with optional reranking.

Usage

Execute this workflow when you need to give your LLM applications access to domain-specific knowledge. This is the workflow for building the knowledge layer that powers retrieval-augmented generation in chatbots, agents, and workflows. Use it when you have documents, web content, or structured data that should be searchable by your AI applications.

Execution Steps

Step 1: Create Dataset

Create a new dataset (knowledge base) with a name and optional description. Choose between creating an empty dataset for later population or uploading documents immediately during creation. Configure the indexing method and embedding model that will be used for all documents in this dataset.

Dataset configuration:

  • Name and description for identification
  • Indexing technique: high-quality (vector + keyword) or economical (keyword only)
  • Embedding model selection from configured providers
  • Permission settings: only-me or all-team-members

Step 2: Upload and Add Documents

Add documents to the dataset from supported data sources. The platform accepts file uploads (PDF, TXT, Markdown, DOCX, HTML, CSV, and more), web page URLs for crawling, and Notion page imports. Each document is queued for asynchronous processing.

Data sources:

  • File upload: Direct upload of supported document formats
  • Web crawling: Import content from URLs
  • Notion sync: Import pages from connected Notion workspaces
  • Batch upload of multiple files simultaneously
  • Metadata assignment for document-level attributes

Step 3: Configure Chunking Strategy

Select and configure the text chunking strategy that determines how documents are split into retrievable segments. Different strategies suit different content types and retrieval patterns.

Chunking modes:

  • Automatic: Platform selects optimal chunk size and overlap
  • Custom: Manually specify chunk size, overlap, and separator rules
  • Parent-child (hierarchical): Create nested chunks where parent chunks provide context and child chunks provide precision
  • QA mode: Generate question-answer pairs from document content for structured retrieval

Parameters:

  • Maximum chunk size (tokens)
  • Chunk overlap for context continuity
  • Separator rules for splitting boundaries
  • Pre-processing rules (whitespace, URL handling)

Step 4: Monitor Indexing Progress

Track the asynchronous document processing pipeline as it progresses through parsing, chunking, embedding, and indexing stages. The platform processes documents via Celery task queues with priority support for urgent updates.

Processing stages:

  • Document parsing and text extraction
  • Text chunking according to configured strategy
  • Embedding generation via the selected model
  • Vector index insertion into the configured vector database
  • Keyword index building for hybrid retrieval

Monitoring:

  • Per-document indexing status (queued, processing, completed, failed, paused)
  • Word count and segment count after processing
  • Ability to pause and resume indexing
  • Re-indexing for configuration changes

Step 5: Manage Segments and Metadata

Review and manage individual text segments after indexing. Edit segment content, add or modify metadata, enable or disable specific segments, and manage document-level metadata fields. Segments can be searched, filtered, and individually toggled for retrieval.

Segment management:

  • Browse segments with keyword search and filtering
  • Edit segment content and metadata
  • Enable or disable individual segments
  • Add custom metadata fields to documents
  • Batch operations for bulk segment management

Step 6: Test Retrieval Quality

Use the built-in hit testing feature to evaluate retrieval quality. Submit test queries and examine which segments are retrieved, their relevance scores, and their ranking order. Adjust retrieval parameters based on test results.

Hit testing:

  • Submit natural language queries
  • View retrieved segments with relevance scores
  • Compare results across retrieval methods (keyword, semantic, hybrid)
  • Adjust top-K and score threshold parameters
  • Validate that important content is retrievable

Execution Diagram

GitHub URL

Workflow Repository