Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Unstructured IO Unstructured ChunkingOptions

From Leeroopedia
Knowledge Sources
Domains Document_Processing, RAG, Configuration
Last Updated 2026-02-12 00:00 GMT

Overview

Concrete tool for validating and resolving chunking size parameters provided by the Unstructured library.

Description

The ChunkingOptions class accepts raw chunking parameters (max_characters, max_tokens, overlap, etc.), validates them, resolves defaults, and exposes computed properties (hard_max, soft_max, overlap, combine_text_under_n_chars). It supports both character-based and token-based sizing, with optional tiktoken integration for accurate token counting.

Usage

This class is used internally by chunk_elements and chunk_by_title. Import it directly only when you need to pre-validate chunking parameters, inspect computed size limits, or build custom chunking logic on top of the same parameter resolution infrastructure.

Code Reference

Source Location

  • Repository: unstructured
  • File: unstructured/chunking/base.py
  • Lines: 81-333

Signature

class ChunkingOptions:
    def __init__(self, **kwargs: Any):
        """Initialize chunking options from keyword arguments.

        Supported kwargs:
            max_characters (int): Hard max chunk size in characters (default 500).
            new_after_n_chars (int): Soft max to trigger new chunk.
            max_tokens (int): Hard max chunk size in tokens.
            new_after_n_tokens (int): Soft max in tokens.
            overlap (int): Character overlap between chunks.
            overlap_all (bool): Apply overlap to all chunk types.
            combine_text_under_n_chars (int): Merge small sections (by_title only).
            multipage_sections (bool): Allow cross-page chunks (by_title only).
            include_orig_elements (bool): Store source elements in metadata.
            tokenizer (str): Tokenizer name for token-based chunking.
        """

    @classmethod
    def new(cls, **kwargs: Any) -> Self:
        """Factory method that returns the appropriate subclass instance."""

Import

from unstructured.chunking.base import ChunkingOptions

I/O Contract

Inputs

Name Type Required Description
max_characters int No Hard max chunk size in characters (default 500)
new_after_n_chars int No Soft max to trigger new chunk
max_tokens int No Hard max in tokens (requires tiktoken)
new_after_n_tokens int No Soft max in tokens
overlap int No Character overlap between chunks
overlap_all bool No Apply overlap to all chunk types
combine_text_under_n_chars int No Merge small sections (by_title only)
multipage_sections bool No Allow cross-page chunks (default True)
include_orig_elements bool No Store original elements in metadata
tokenizer str No Tokenizer name for token counting

Outputs (Properties)

Name Type Description
hard_max int Resolved absolute maximum chunk size in characters
soft_max int Resolved soft maximum (≤ hard_max)
overlap int Resolved overlap in characters
inter_chunk_overlap int Computed overlap between consecutive text chunks
combine_text_under_n_chars int Resolved minimum section size for by_title

Usage Examples

Inspect Resolved Parameters

from unstructured.chunking.base import ChunkingOptions

opts = ChunkingOptions(
    max_characters=1000,
    new_after_n_chars=800,
    overlap=100,
)

print(f"Hard max: {opts.hard_max}")   # 1000
print(f"Soft max: {opts.soft_max}")   # 800
print(f"Overlap: {opts.overlap}")     # 100

Related Pages

Implements Principle

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment