Implementation:Unstructured IO Unstructured Chunk By Title
| Knowledge Sources | |
|---|---|
| Domains | Document_Processing, RAG, Text_Splitting |
| Last Updated | 2026-02-12 00:00 GMT |
Overview
Concrete tool for section-aware chunking of document elements at title boundaries provided by the Unstructured library.
Description
The chunk_by_title function implements section-aware chunking. It starts new chunks when a Title element is encountered, respecting both structural boundaries and size constraints. It supports merging undersized sections, controlling page-spanning behavior, and all the same size/overlap parameters as basic chunking.
Usage
Import this function when you need chunks that respect document section structure. This is the recommended chunking strategy for RAG pipelines processing structured documents like reports, papers, and manuals where topical coherence matters for retrieval quality.
Code Reference
Source Location
- Repository: unstructured
- File: unstructured/chunking/title.py
- Lines: 23-99
Signature
def chunk_by_title(
elements: Iterable[Element],
*,
combine_text_under_n_chars: Optional[int] = None,
include_orig_elements: Optional[bool] = None,
max_characters: Optional[int] = None,
max_tokens: Optional[int] = None,
multipage_sections: Optional[bool] = None,
new_after_n_chars: Optional[int] = None,
new_after_n_tokens: Optional[int] = None,
overlap: Optional[int] = None,
overlap_all: Optional[bool] = None,
tokenizer: Optional[str] = None,
) -> list[Element]:
"""Chunk elements at title boundaries with size constraints.
Args:
elements: Iterable of Element objects to chunk.
combine_text_under_n_chars: Merge sections smaller than this threshold.
include_orig_elements: Preserve original elements in chunk metadata.
max_characters: Hard maximum chunk size in characters (default 500).
max_tokens: Hard maximum chunk size in tokens.
multipage_sections: Allow chunks to span page boundaries (default True).
new_after_n_chars: Soft max to trigger new chunk.
new_after_n_tokens: Soft max in tokens.
overlap: Character overlap between consecutive chunks.
overlap_all: Apply overlap to all chunks.
tokenizer: Tokenizer name for token-based chunking.
Returns:
List of chunked elements respecting section boundaries.
"""
Import
from unstructured.chunking.title import chunk_by_title
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| elements | Iterable[Element] | Yes | Elements from partitioning |
| max_characters | None | No | Hard max chunk size (default 500) |
| new_after_n_chars | None | No | Soft max to start new chunk |
| combine_text_under_n_chars | None | No | Merge small sections below this threshold |
| multipage_sections | None | No | Allow cross-page chunks (default True) |
| overlap | None | No | Character overlap between chunks |
| include_orig_elements | None | No | Store original elements in metadata |
Outputs
| Name | Type | Description |
|---|---|---|
| return | list[Element] | Chunked elements aligned to section boundaries: CompositeElement for text, TableChunk for split tables |
Usage Examples
Section-Aware Chunking for RAG
from unstructured.partition.auto import partition
from unstructured.chunking.title import chunk_by_title
elements = partition(filename="annual_report.pdf", strategy="hi_res")
chunks = chunk_by_title(
elements,
max_characters=1500,
new_after_n_chars=1200,
combine_text_under_n_chars=200,
overlap=100,
)
for chunk in chunks:
print(f"Length: {len(str(chunk))}, Text: {str(chunk)[:60]}...")
Page-Aligned Chunks
from unstructured.chunking.title import chunk_by_title
# Force chunks to not span page boundaries
chunks = chunk_by_title(
elements,
max_characters=1000,
multipage_sections=False,
)
Via Dispatch Function
from unstructured.chunking.dispatch import chunk
chunks = chunk(
elements,
chunking_strategy="by_title",
max_characters=1000,
combine_text_under_n_chars=200,
include_orig_elements=True,
)