Principle:Unstructured IO Unstructured Section Aware Chunking

Knowledge Sources	Unstructured Unstructured Docs Chunking for RAG
Domains	Document_Processing, RAG, Text_Splitting
Last Updated	2026-02-12 00:00 GMT

Overview

A structure-aware text splitting strategy that respects document section boundaries by starting new chunks at title elements, preserving topical coherence within each chunk.

Description

Section-aware chunking (also called "by-title" chunking) improves upon basic chunking by using the document's own structure to determine chunk boundaries. When a Title element is encountered, it triggers the start of a new chunk (subject to size constraints). This ensures that chunks align with the document's logical sections, keeping related content together.

This approach produces chunks with higher topical coherence, which improves retrieval quality in RAG (Retrieval-Augmented Generation) pipelines. A chunk about "Financial Results" will not bleed into a chunk about "Risk Factors" because the title boundary enforces separation.

Additional features include combining small sections that fall below a minimum size threshold, controlling whether chunks can span page boundaries, and the same overlap/size controls available in basic chunking.

Usage

Use this principle when processing structured documents (reports, papers, manuals) where section boundaries carry semantic meaning. It is the recommended chunking strategy for RAG applications where retrieval quality depends on topical coherence within chunks. Prefer basic chunking only when documents lack clear section structure.

Theoretical Basis

Section-aware chunking extends the basic greedy fill algorithm with a structural boundary rule:

# Abstract by-title chunking algorithm
chunks = []
current_section = []
current_size = 0

for element in elements:
    is_boundary = isinstance(element, Title)
    is_page_boundary = (not multipage_sections and
                        element.page != current_page)

    if (is_boundary or is_page_boundary) and current_section:
        if current_size >= combine_text_under_n_chars:
            chunks.append(merge(current_section))
            current_section = []
            current_size = 0

    if current_size + len(str(element)) > soft_max and current_section:
        chunks.append(merge(current_section))
        current_section = get_overlap(current_section, overlap)
        current_size = size_of(current_section)

    current_section.append(element)
    current_size += len(str(element))

Additional parameters:

combine_text_under_n_chars: Minimum section size. Sections smaller than this are merged with the next section rather than emitted as a tiny chunk.
multipage_sections: Whether sections can span page boundaries (default True). Set to False to force page-aligned chunks.

Related Pages

Implemented By

Implementation:Unstructured_IO_Unstructured_Chunk_By_Title

Uses Heuristic

Heuristic:Unstructured_IO_Unstructured_Chunk_Size_Tuning

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment