Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Unstructured IO Unstructured Chunk By Title

From Leeroopedia
Knowledge Sources
Domains Document_Processing, RAG, Text_Splitting
Last Updated 2026-02-12 00:00 GMT

Overview

Concrete tool for section-aware chunking of document elements at title boundaries provided by the Unstructured library.

Description

The chunk_by_title function implements section-aware chunking. It starts new chunks when a Title element is encountered, respecting both structural boundaries and size constraints. It supports merging undersized sections, controlling page-spanning behavior, and all the same size/overlap parameters as basic chunking.

Usage

Import this function when you need chunks that respect document section structure. This is the recommended chunking strategy for RAG pipelines processing structured documents like reports, papers, and manuals where topical coherence matters for retrieval quality.

Code Reference

Source Location

  • Repository: unstructured
  • File: unstructured/chunking/title.py
  • Lines: 23-99

Signature

def chunk_by_title(
    elements: Iterable[Element],
    *,
    combine_text_under_n_chars: Optional[int] = None,
    include_orig_elements: Optional[bool] = None,
    max_characters: Optional[int] = None,
    max_tokens: Optional[int] = None,
    multipage_sections: Optional[bool] = None,
    new_after_n_chars: Optional[int] = None,
    new_after_n_tokens: Optional[int] = None,
    overlap: Optional[int] = None,
    overlap_all: Optional[bool] = None,
    tokenizer: Optional[str] = None,
) -> list[Element]:
    """Chunk elements at title boundaries with size constraints.

    Args:
        elements: Iterable of Element objects to chunk.
        combine_text_under_n_chars: Merge sections smaller than this threshold.
        include_orig_elements: Preserve original elements in chunk metadata.
        max_characters: Hard maximum chunk size in characters (default 500).
        max_tokens: Hard maximum chunk size in tokens.
        multipage_sections: Allow chunks to span page boundaries (default True).
        new_after_n_chars: Soft max to trigger new chunk.
        new_after_n_tokens: Soft max in tokens.
        overlap: Character overlap between consecutive chunks.
        overlap_all: Apply overlap to all chunks.
        tokenizer: Tokenizer name for token-based chunking.
    Returns:
        List of chunked elements respecting section boundaries.
    """

Import

from unstructured.chunking.title import chunk_by_title

I/O Contract

Inputs

Name Type Required Description
elements Iterable[Element] Yes Elements from partitioning
max_characters None No Hard max chunk size (default 500)
new_after_n_chars None No Soft max to start new chunk
combine_text_under_n_chars None No Merge small sections below this threshold
multipage_sections None No Allow cross-page chunks (default True)
overlap None No Character overlap between chunks
include_orig_elements None No Store original elements in metadata

Outputs

Name Type Description
return list[Element] Chunked elements aligned to section boundaries: CompositeElement for text, TableChunk for split tables

Usage Examples

Section-Aware Chunking for RAG

from unstructured.partition.auto import partition
from unstructured.chunking.title import chunk_by_title

elements = partition(filename="annual_report.pdf", strategy="hi_res")

chunks = chunk_by_title(
    elements,
    max_characters=1500,
    new_after_n_chars=1200,
    combine_text_under_n_chars=200,
    overlap=100,
)

for chunk in chunks:
    print(f"Length: {len(str(chunk))}, Text: {str(chunk)[:60]}...")

Page-Aligned Chunks

from unstructured.chunking.title import chunk_by_title

# Force chunks to not span page boundaries
chunks = chunk_by_title(
    elements,
    max_characters=1000,
    multipage_sections=False,
)

Via Dispatch Function

from unstructured.chunking.dispatch import chunk

chunks = chunk(
    elements,
    chunking_strategy="by_title",
    max_characters=1000,
    combine_text_under_n_chars=200,
    include_orig_elements=True,
)

Related Pages

Implements Principle

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment