Workflow:Langgenius Dify Knowledge Base Creation

Knowledge Sources	Dify Dify Docs
Domains	RAG, Knowledge_Management, LLMs
Last Updated	2026-02-08 14:00 GMT

Overview

End-to-end process for creating a Knowledge Base (RAG pipeline) in Dify, from data source ingestion through chunking, embedding, and retrieval configuration.

Description

This workflow covers the complete Knowledge Base creation lifecycle in Dify. Users select a data source (file upload, Notion sync, or web scraping), configure document chunking strategies (general, Q&A, parent-child, or graph modes), choose an indexing method (high-quality embedding or economy keyword-based), and set up retrieval parameters. The result is a searchable knowledge base that can be connected to chatbots, agents, and workflows for Retrieval-Augmented Generation.

Usage

Execute this workflow when you have domain-specific documents (PDFs, text files, web pages, or Notion content) that you want to make available to LLM applications as a knowledge source. You need at least one configured embedding model provider for high-quality indexing, or can use the economy mode with keyword-based retrieval.

Execution Steps

Step 1: Data Source Selection

Choose the source of documents for the knowledge base. Dify supports multiple ingestion methods that each handle different content origins. File upload accepts drag-and-drop of common document formats. Notion sync connects to a Notion workspace via API integration. Web scraping uses configurable crawler providers to extract content from URLs.

Supported data sources:

File Upload — Drag and drop or browse for PDFs, text files, DOCX, and other supported formats
Notion Sync — Connect to a Notion workspace and select pages to import
Website Crawl — Crawl web pages using Firecrawl, Jina Reader, or Watercrawl providers
Empty Knowledge — Create the knowledge base first and add documents later

Step 2: Document Processing Configuration

Configure how ingested documents are split into retrievable chunks. Select a chunking mode that matches the content structure, set chunk size and overlap parameters, and define text preprocessing rules.

Chunking modes:

General (text_model) — Standard paragraph-based splitting for most document types
Q&A (qa_model) — Pair-based format for FAQ-style content (requires high-quality indexing)
Parent-Child (hierarchical_model) — Multi-level hierarchical chunks with parent context preservation
Graph (graph_model) — Reserved for future Graph RAG capabilities

Preprocessing options:

Replace consecutive whitespace characters
Remove stopwords
Delete URLs and email addresses

Step 3: Indexing Method Selection

Choose the indexing strategy that balances accuracy, cost, and resource requirements. High-quality mode uses embedding models to create vector representations of each chunk. Economy mode extracts keywords without consuming embedding tokens.

Indexing methods:

High Quality — Vector embeddings via configured model provider (higher accuracy, consumes tokens)
Economy — Keyword extraction with 10 keywords per chunk (no token cost, lower accuracy)

Step 4: Embedding and Indexing Execution

The system processes all documents through the configured pipeline. Documents transition through states: queuing, indexing, and completed (or error). Users can pause, resume, or cancel processing. Token and chunk count estimates are provided before execution begins.

What happens:

Documents are queued for processing
Each document is split into chunks per the configured strategy
Chunks are embedded (high-quality) or keyword-indexed (economy)
Processing status is tracked per document with progress indicators

Step 5: Retrieval Configuration

Configure how the knowledge base responds to queries. Select the retrieval method, set the number of results to return (top_k), define a similarity score threshold, and optionally enable a reranking model for improved result quality.

Retrieval methods:

Semantic Search — Vector similarity matching
Full-Text Search — Text-based matching
Hybrid Search — Combines semantic and full-text (recommended)
Keyword Search — Inverted index matching

Tunable parameters:

Top K (number of chunks to retrieve)
Score threshold (minimum similarity cutoff)
Rerank model selection and configuration
Weighted score balance between semantic and keyword results

Step 6: Knowledge Base Verification

Test the knowledge base with sample queries using the built-in hit testing interface. Review retrieved chunks, their relevance scores, and source attribution. Iterate on chunking and retrieval settings if results are unsatisfactory.

Verification capabilities:

Query input with real-time retrieval results
Chunk content preview with relevance scores
Retrieval history for comparing configuration changes
Source document attribution

Execution Diagram

GitHub URL

Workflow Repository