Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Langgenius Dify Knowledge Base Creation

From Leeroopedia


Knowledge Sources
Domains RAG, Knowledge_Management, LLMs
Last Updated 2026-02-08 14:00 GMT

Overview

End-to-end process for creating a Knowledge Base (RAG pipeline) in Dify, from data source ingestion through chunking, embedding, and retrieval configuration.

Description

This workflow covers the complete Knowledge Base creation lifecycle in Dify. Users select a data source (file upload, Notion sync, or web scraping), configure document chunking strategies (general, Q&A, parent-child, or graph modes), choose an indexing method (high-quality embedding or economy keyword-based), and set up retrieval parameters. The result is a searchable knowledge base that can be connected to chatbots, agents, and workflows for Retrieval-Augmented Generation.

Usage

Execute this workflow when you have domain-specific documents (PDFs, text files, web pages, or Notion content) that you want to make available to LLM applications as a knowledge source. You need at least one configured embedding model provider for high-quality indexing, or can use the economy mode with keyword-based retrieval.

Execution Steps

Step 1: Data Source Selection

Choose the source of documents for the knowledge base. Dify supports multiple ingestion methods that each handle different content origins. File upload accepts drag-and-drop of common document formats. Notion sync connects to a Notion workspace via API integration. Web scraping uses configurable crawler providers to extract content from URLs.

Supported data sources:

  • File Upload — Drag and drop or browse for PDFs, text files, DOCX, and other supported formats
  • Notion Sync — Connect to a Notion workspace and select pages to import
  • Website Crawl — Crawl web pages using Firecrawl, Jina Reader, or Watercrawl providers
  • Empty Knowledge — Create the knowledge base first and add documents later

Step 2: Document Processing Configuration

Configure how ingested documents are split into retrievable chunks. Select a chunking mode that matches the content structure, set chunk size and overlap parameters, and define text preprocessing rules.

Chunking modes:

  • General (text_model) — Standard paragraph-based splitting for most document types
  • Q&A (qa_model) — Pair-based format for FAQ-style content (requires high-quality indexing)
  • Parent-Child (hierarchical_model) — Multi-level hierarchical chunks with parent context preservation
  • Graph (graph_model) — Reserved for future Graph RAG capabilities

Preprocessing options:

  • Replace consecutive whitespace characters
  • Remove stopwords
  • Delete URLs and email addresses

Step 3: Indexing Method Selection

Choose the indexing strategy that balances accuracy, cost, and resource requirements. High-quality mode uses embedding models to create vector representations of each chunk. Economy mode extracts keywords without consuming embedding tokens.

Indexing methods:

  • High Quality — Vector embeddings via configured model provider (higher accuracy, consumes tokens)
  • Economy — Keyword extraction with 10 keywords per chunk (no token cost, lower accuracy)

Step 4: Embedding and Indexing Execution

The system processes all documents through the configured pipeline. Documents transition through states: queuing, indexing, and completed (or error). Users can pause, resume, or cancel processing. Token and chunk count estimates are provided before execution begins.

What happens:

  • Documents are queued for processing
  • Each document is split into chunks per the configured strategy
  • Chunks are embedded (high-quality) or keyword-indexed (economy)
  • Processing status is tracked per document with progress indicators

Step 5: Retrieval Configuration

Configure how the knowledge base responds to queries. Select the retrieval method, set the number of results to return (top_k), define a similarity score threshold, and optionally enable a reranking model for improved result quality.

Retrieval methods:

  • Semantic Search — Vector similarity matching
  • Full-Text Search — Text-based matching
  • Hybrid Search — Combines semantic and full-text (recommended)
  • Keyword Search — Inverted index matching

Tunable parameters:

  • Top K (number of chunks to retrieve)
  • Score threshold (minimum similarity cutoff)
  • Rerank model selection and configuration
  • Weighted score balance between semantic and keyword results

Step 6: Knowledge Base Verification

Test the knowledge base with sample queries using the built-in hit testing interface. Review retrieved chunks, their relevance scores, and source attribution. Iterate on chunking and retrieval settings if results are unsatisfactory.

Verification capabilities:

  • Query input with real-time retrieval results
  • Chunk content preview with relevance scores
  • Retrieval history for comparing configuration changes
  • Source document attribution

Execution Diagram

GitHub URL

Workflow Repository