Principle:Langgenius Dify Chunking Strategy
| Knowledge Sources | Domains | Last Updated |
|---|---|---|
| Dify | RAG, Knowledge_Management, Frontend | 2026-02-12 00:00 GMT |
Overview
Description
Chunking Strategy governs how uploaded documents are segmented into discrete, indexable units (chunks or segments) within the Dify knowledge base. The quality of chunking directly determines retrieval precision: chunks that are too large dilute relevance signals, while chunks that are too small lose contextual coherence.
Dify exposes chunking configuration through process rules -- a combination of a chunking mode, segmentation parameters, and pre-processing rules. The platform provides both a default process rule (system-recommended settings) and the ability to fetch the process rule used by an existing document, enabling users to replicate or adjust proven configurations.
Three distinct chunking modes (referred to as ChunkingMode or doc_form) are supported:
- text_model (General Text) -- Standard text segmentation using separator-based splitting with configurable token limits and overlap.
- qa_model (Q&A) -- Specialized segmentation that structures content into question-answer pairs, optimized for FAQ-style knowledge bases.
- hierarchical_model (Parent-Child) -- Two-tier segmentation where parent chunks provide broad context and child sub-chunks provide granular retrieval targets, supporting both
full-docandparagraphparent modes.
Usage
- Configuring new document uploads -- Before submitting a document, fetch the default process rule to populate the UI with recommended segmentation settings.
- Replicating existing configurations -- When adding related documents, fetch the process rule of an already-indexed document and reuse its settings for consistency.
- Fine-tuning segmentation -- Adjust
separator,max_tokens, andchunk_overlapparameters to optimize chunk boundaries for a specific content type (e.g., technical documentation vs. conversational transcripts). - Hierarchical retrieval -- Select
hierarchical_modelto enable parent-child chunking, configuring both the parent segmentation and the sub-chunk segmentation independently.
Theoretical Basis
- Fixed-Size vs. Semantic Chunking -- Dify adopts a separator-plus-token-limit approach: documents are first split at natural boundaries (e.g., double newlines) and then capped at a maximum token count. This hybrid strategy preserves paragraph structure while ensuring uniform chunk sizes for the embedding model.
- Chunk Overlap -- The optional
chunk_overlapparameter creates redundancy at chunk boundaries, reducing the risk of splitting critical information across two chunks. This is a well-established technique in information retrieval to improve recall at segment edges. - Pre-Processing Rules -- Before segmentation, optional rules (e.g.,
remove_extra_spaces,remove_urls_emails) clean the raw text. These transformations improve embedding quality by removing noise that could distort vector representations. - Parent-Child Indexing -- The hierarchical model implements a coarse-to-fine retrieval pattern. Parent chunks (full document or paragraph level) are stored alongside finer-grained child chunks. At query time, child chunks provide precise matches while their parent chunks supply broader context for the language model.
- Process Mode Distinction -- The
ProcessModeenum distinguishes betweencustom(user-defined rules) andhierarchical(parent-child mode), with the latter unlocking additional configuration fields likeparent_modeandsubchunk_segmentation.