Workflow:Langgenius Dify RAG Pipeline Development
| Knowledge Sources | |
|---|---|
| Domains | RAG, Data_Engineering, Pipeline_Orchestration |
| Last Updated | 2026-02-12 07:00 GMT |
Overview
End-to-end process for building, customizing, and executing RAG (Retrieval-Augmented Generation) data processing pipelines in Dify, from template selection through node configuration to pipeline publishing.
Description
This workflow covers the creation and management of RAG pipelines, which are specialized data processing workflows for knowledge base indexing. RAG pipelines define how documents flow from datasource plugins through processing nodes (chunking, transformation, embedding) into vector databases. Users select from built-in or customized pipeline templates, configure processing parameters, test pipeline execution on sample documents, and save customized pipelines as reusable templates. Pipelines can be exported and imported as DSL (YAML) files for portability.
Usage
Execute this workflow when you need to customize the data processing pipeline for knowledge base indexing beyond the default chunking and embedding configuration. This is appropriate when you have specialized document formats, need custom preprocessing steps, want to combine multiple datasource plugins, or need fine-grained control over the chunking and embedding process. RAG pipelines complement the simpler knowledge base management workflow by providing workflow-level control over the indexing process.
Execution Steps
Step 1: Select Pipeline Template
Browse available pipeline templates to find a starting point for your data processing pipeline. Templates provide pre-configured node graphs for common indexing scenarios. Choose between built-in templates (maintained by the platform) and customized templates (previously saved by users in your workspace).
Template types:
- Built-in templates: Pre-configured pipelines for common document types and indexing strategies
- Customized templates: User-created pipelines saved as reusable templates
- Import from DSL: Load a pipeline definition from a YAML file
Step 2: Configure Datasource Nodes
Set up the datasource nodes that define where documents come from. Configure datasource plugin connections, authentication credentials, and source-specific parameters. Multiple datasource nodes can feed into a single pipeline for aggregating content from different sources.
Datasource configuration:
- Select from available datasource plugins
- Configure authentication for the data source
- Set source-specific parameters (file paths, URLs, API endpoints)
- Define input fields for parameterized pipeline execution
Step 3: Configure Processing Nodes
Set up the processing nodes that transform raw documents into indexed chunks. Configure chunking parameters, text preprocessing rules, and any custom transformation logic. The pipeline editor provides a visual interface for connecting processing stages.
Processing capabilities:
- Text extraction from various document formats
- Chunking with configurable size, overlap, and separator rules
- Parent-child (hierarchical) chunking for nested retrieval
- QA pair generation for structured retrieval
- Custom text transformation and cleaning
Step 4: Configure Embedding and Indexing
Set up the embedding model and vector database target for the processed chunks. Choose the embedding provider and model, configure batch processing parameters, and specify the target index configuration.
Embedding configuration:
- Select embedding model provider and model
- Configure batch size and processing parameters
- Target vector database configuration
- Keyword index settings for hybrid retrieval
Step 5: Test Pipeline Execution
Run the pipeline on sample documents to validate the processing flow. The test-run interface provides step-by-step navigation through the pipeline stages, showing intermediate results at each processing node. Filter test data by datasource and review chunk output quality.
Testing features:
- Step-by-step pipeline execution with intermediate results
- Datasource filtering for targeted testing
- Chunk preview across all chunking modes (text, parent-child, QA)
- Data clearing for re-testing with different configurations
- Processing parameter adjustment based on results
Step 6: Publish and Save as Template
Once the pipeline produces satisfactory results, publish it for production use. Optionally save the customized pipeline as a reusable template for future datasets. Published pipelines execute automatically when new documents are added to associated knowledge bases.
Publishing options:
- Publish for production execution on new documents
- Save as customized template for reuse
- Export as DSL (YAML) for sharing or version control
- Associate with knowledge bases for automatic execution