Workflow:Langgenius Dify RAG Pipeline Development

Knowledge Sources	Dify Dify Docs
Domains	RAG, Data_Engineering, Pipeline_Orchestration
Last Updated	2026-02-12 07:00 GMT

Overview

End-to-end process for building, customizing, and executing RAG (Retrieval-Augmented Generation) data processing pipelines in Dify, from template selection through node configuration to pipeline publishing.

Description

This workflow covers the creation and management of RAG pipelines, which are specialized data processing workflows for knowledge base indexing. RAG pipelines define how documents flow from datasource plugins through processing nodes (chunking, transformation, embedding) into vector databases. Users select from built-in or customized pipeline templates, configure processing parameters, test pipeline execution on sample documents, and save customized pipelines as reusable templates. Pipelines can be exported and imported as DSL (YAML) files for portability.

Usage

Execute this workflow when you need to customize the data processing pipeline for knowledge base indexing beyond the default chunking and embedding configuration. This is appropriate when you have specialized document formats, need custom preprocessing steps, want to combine multiple datasource plugins, or need fine-grained control over the chunking and embedding process. RAG pipelines complement the simpler knowledge base management workflow by providing workflow-level control over the indexing process.

Execution Steps

Step 1: Select Pipeline Template

Browse available pipeline templates to find a starting point for your data processing pipeline. Templates provide pre-configured node graphs for common indexing scenarios. Choose between built-in templates (maintained by the platform) and customized templates (previously saved by users in your workspace).

Template types:

Built-in templates: Pre-configured pipelines for common document types and indexing strategies
Customized templates: User-created pipelines saved as reusable templates
Import from DSL: Load a pipeline definition from a YAML file

Step 2: Configure Datasource Nodes

Set up the datasource nodes that define where documents come from. Configure datasource plugin connections, authentication credentials, and source-specific parameters. Multiple datasource nodes can feed into a single pipeline for aggregating content from different sources.

Datasource configuration:

Select from available datasource plugins
Configure authentication for the data source
Set source-specific parameters (file paths, URLs, API endpoints)
Define input fields for parameterized pipeline execution

Step 3: Configure Processing Nodes

Set up the processing nodes that transform raw documents into indexed chunks. Configure chunking parameters, text preprocessing rules, and any custom transformation logic. The pipeline editor provides a visual interface for connecting processing stages.

Processing capabilities:

Text extraction from various document formats
Chunking with configurable size, overlap, and separator rules
Parent-child (hierarchical) chunking for nested retrieval
QA pair generation for structured retrieval
Custom text transformation and cleaning

Step 4: Configure Embedding and Indexing

Set up the embedding model and vector database target for the processed chunks. Choose the embedding provider and model, configure batch processing parameters, and specify the target index configuration.

Embedding configuration:

Select embedding model provider and model
Configure batch size and processing parameters
Target vector database configuration
Keyword index settings for hybrid retrieval

Step 5: Test Pipeline Execution

Run the pipeline on sample documents to validate the processing flow. The test-run interface provides step-by-step navigation through the pipeline stages, showing intermediate results at each processing node. Filter test data by datasource and review chunk output quality.

Testing features:

Step-by-step pipeline execution with intermediate results
Datasource filtering for targeted testing
Chunk preview across all chunking modes (text, parent-child, QA)
Data clearing for re-testing with different configurations
Processing parameter adjustment based on results

Step 6: Publish and Save as Template

Once the pipeline produces satisfactory results, publish it for production use. Optionally save the customized pipeline as a reusable template for future datasets. Published pipelines execute automatically when new documents are added to associated knowledge bases.

Publishing options:

Publish for production execution on new documents
Save as customized template for reuse
Export as DSL (YAML) for sharing or version control
Associate with knowledge bases for automatic execution

Execution Diagram

GitHub URL

Workflow Repository