Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Unstructured IO Unstructured Section Aware Chunking

From Leeroopedia
Knowledge Sources
Domains Document_Processing, RAG, Text_Splitting
Last Updated 2026-02-12 00:00 GMT

Overview

A structure-aware text splitting strategy that respects document section boundaries by starting new chunks at title elements, preserving topical coherence within each chunk.

Description

Section-aware chunking (also called "by-title" chunking) improves upon basic chunking by using the document's own structure to determine chunk boundaries. When a Title element is encountered, it triggers the start of a new chunk (subject to size constraints). This ensures that chunks align with the document's logical sections, keeping related content together.

This approach produces chunks with higher topical coherence, which improves retrieval quality in RAG (Retrieval-Augmented Generation) pipelines. A chunk about "Financial Results" will not bleed into a chunk about "Risk Factors" because the title boundary enforces separation.

Additional features include combining small sections that fall below a minimum size threshold, controlling whether chunks can span page boundaries, and the same overlap/size controls available in basic chunking.

Usage

Use this principle when processing structured documents (reports, papers, manuals) where section boundaries carry semantic meaning. It is the recommended chunking strategy for RAG applications where retrieval quality depends on topical coherence within chunks. Prefer basic chunking only when documents lack clear section structure.

Theoretical Basis

Section-aware chunking extends the basic greedy fill algorithm with a structural boundary rule:

# Abstract by-title chunking algorithm
chunks = []
current_section = []
current_size = 0

for element in elements:
    is_boundary = isinstance(element, Title)
    is_page_boundary = (not multipage_sections and
                        element.page != current_page)

    if (is_boundary or is_page_boundary) and current_section:
        if current_size >= combine_text_under_n_chars:
            chunks.append(merge(current_section))
            current_section = []
            current_size = 0

    if current_size + len(str(element)) > soft_max and current_section:
        chunks.append(merge(current_section))
        current_section = get_overlap(current_section, overlap)
        current_size = size_of(current_section)

    current_section.append(element)
    current_size += len(str(element))

Additional parameters:

  • combine_text_under_n_chars: Minimum section size. Sections smaller than this are merged with the next section rather than emitted as a tiny chunk.
  • multipage_sections: Whether sections can span page boundaries (default True). Set to False to force page-aligned chunks.

Related Pages

Implemented By

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment