Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Langgenius Dify Chunking Strategy

From Leeroopedia
Knowledge Sources Domains Last Updated
Dify RAG, Knowledge_Management, Frontend 2026-02-12 00:00 GMT

Overview

Description

Chunking Strategy governs how uploaded documents are segmented into discrete, indexable units (chunks or segments) within the Dify knowledge base. The quality of chunking directly determines retrieval precision: chunks that are too large dilute relevance signals, while chunks that are too small lose contextual coherence.

Dify exposes chunking configuration through process rules -- a combination of a chunking mode, segmentation parameters, and pre-processing rules. The platform provides both a default process rule (system-recommended settings) and the ability to fetch the process rule used by an existing document, enabling users to replicate or adjust proven configurations.

Three distinct chunking modes (referred to as ChunkingMode or doc_form) are supported:

  • text_model (General Text) -- Standard text segmentation using separator-based splitting with configurable token limits and overlap.
  • qa_model (Q&A) -- Specialized segmentation that structures content into question-answer pairs, optimized for FAQ-style knowledge bases.
  • hierarchical_model (Parent-Child) -- Two-tier segmentation where parent chunks provide broad context and child sub-chunks provide granular retrieval targets, supporting both full-doc and paragraph parent modes.

Usage

  • Configuring new document uploads -- Before submitting a document, fetch the default process rule to populate the UI with recommended segmentation settings.
  • Replicating existing configurations -- When adding related documents, fetch the process rule of an already-indexed document and reuse its settings for consistency.
  • Fine-tuning segmentation -- Adjust separator, max_tokens, and chunk_overlap parameters to optimize chunk boundaries for a specific content type (e.g., technical documentation vs. conversational transcripts).
  • Hierarchical retrieval -- Select hierarchical_model to enable parent-child chunking, configuring both the parent segmentation and the sub-chunk segmentation independently.

Theoretical Basis

  • Fixed-Size vs. Semantic Chunking -- Dify adopts a separator-plus-token-limit approach: documents are first split at natural boundaries (e.g., double newlines) and then capped at a maximum token count. This hybrid strategy preserves paragraph structure while ensuring uniform chunk sizes for the embedding model.
  • Chunk Overlap -- The optional chunk_overlap parameter creates redundancy at chunk boundaries, reducing the risk of splitting critical information across two chunks. This is a well-established technique in information retrieval to improve recall at segment edges.
  • Pre-Processing Rules -- Before segmentation, optional rules (e.g., remove_extra_spaces, remove_urls_emails) clean the raw text. These transformations improve embedding quality by removing noise that could distort vector representations.
  • Parent-Child Indexing -- The hierarchical model implements a coarse-to-fine retrieval pattern. Parent chunks (full document or paragraph level) are stored alongside finer-grained child chunks. At query time, child chunks provide precise matches while their parent chunks supply broader context for the language model.
  • Process Mode Distinction -- The ProcessMode enum distinguishes between custom (user-defined rules) and hierarchical (parent-child mode), with the latter unlocking additional configuration fields like parent_mode and subchunk_segmentation.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment