Principle:Unstructured IO Unstructured Section Aware Chunking
| Knowledge Sources | |
|---|---|
| Domains | Document_Processing, RAG, Text_Splitting |
| Last Updated | 2026-02-12 00:00 GMT |
Overview
A structure-aware text splitting strategy that respects document section boundaries by starting new chunks at title elements, preserving topical coherence within each chunk.
Description
Section-aware chunking (also called "by-title" chunking) improves upon basic chunking by using the document's own structure to determine chunk boundaries. When a Title element is encountered, it triggers the start of a new chunk (subject to size constraints). This ensures that chunks align with the document's logical sections, keeping related content together.
This approach produces chunks with higher topical coherence, which improves retrieval quality in RAG (Retrieval-Augmented Generation) pipelines. A chunk about "Financial Results" will not bleed into a chunk about "Risk Factors" because the title boundary enforces separation.
Additional features include combining small sections that fall below a minimum size threshold, controlling whether chunks can span page boundaries, and the same overlap/size controls available in basic chunking.
Usage
Use this principle when processing structured documents (reports, papers, manuals) where section boundaries carry semantic meaning. It is the recommended chunking strategy for RAG applications where retrieval quality depends on topical coherence within chunks. Prefer basic chunking only when documents lack clear section structure.
Theoretical Basis
Section-aware chunking extends the basic greedy fill algorithm with a structural boundary rule:
# Abstract by-title chunking algorithm
chunks = []
current_section = []
current_size = 0
for element in elements:
is_boundary = isinstance(element, Title)
is_page_boundary = (not multipage_sections and
element.page != current_page)
if (is_boundary or is_page_boundary) and current_section:
if current_size >= combine_text_under_n_chars:
chunks.append(merge(current_section))
current_section = []
current_size = 0
if current_size + len(str(element)) > soft_max and current_section:
chunks.append(merge(current_section))
current_section = get_overlap(current_section, overlap)
current_size = size_of(current_section)
current_section.append(element)
current_size += len(str(element))
Additional parameters:
- combine_text_under_n_chars: Minimum section size. Sections smaller than this are merged with the next section rather than emitted as a tiny chunk.
- multipage_sections: Whether sections can span page boundaries (default True). Set to False to force page-aligned chunks.