Implementation:Unstructured IO Unstructured ChunkingOptions

Knowledge Sources	Unstructured
Domains	Document_Processing, RAG, Configuration
Last Updated	2026-02-12 00:00 GMT

Overview

Concrete tool for validating and resolving chunking size parameters provided by the Unstructured library.

Description

The ChunkingOptions class accepts raw chunking parameters (max_characters, max_tokens, overlap, etc.), validates them, resolves defaults, and exposes computed properties (hard_max, soft_max, overlap, combine_text_under_n_chars). It supports both character-based and token-based sizing, with optional tiktoken integration for accurate token counting.

Usage

This class is used internally by chunk_elements and chunk_by_title. Import it directly only when you need to pre-validate chunking parameters, inspect computed size limits, or build custom chunking logic on top of the same parameter resolution infrastructure.

Code Reference

Source Location

Repository: unstructured
File: unstructured/chunking/base.py
Lines: 81-333

Signature

class ChunkingOptions:
    def __init__(self, **kwargs: Any):
        """Initialize chunking options from keyword arguments.

        Supported kwargs:
            max_characters (int): Hard max chunk size in characters (default 500).
            new_after_n_chars (int): Soft max to trigger new chunk.
            max_tokens (int): Hard max chunk size in tokens.
            new_after_n_tokens (int): Soft max in tokens.
            overlap (int): Character overlap between chunks.
            overlap_all (bool): Apply overlap to all chunk types.
            combine_text_under_n_chars (int): Merge small sections (by_title only).
            multipage_sections (bool): Allow cross-page chunks (by_title only).
            include_orig_elements (bool): Store source elements in metadata.
            tokenizer (str): Tokenizer name for token-based chunking.
        """

    @classmethod
    def new(cls, **kwargs: Any) -> Self:
        """Factory method that returns the appropriate subclass instance."""

Import

from unstructured.chunking.base import ChunkingOptions

I/O Contract

Inputs

Name	Type	Required	Description
max_characters	int	No	Hard max chunk size in characters (default 500)
new_after_n_chars	int	No	Soft max to trigger new chunk
max_tokens	int	No	Hard max in tokens (requires tiktoken)
new_after_n_tokens	int	No	Soft max in tokens
overlap	int	No	Character overlap between chunks
overlap_all	bool	No	Apply overlap to all chunk types
combine_text_under_n_chars	int	No	Merge small sections (by_title only)
multipage_sections	bool	No	Allow cross-page chunks (default True)
include_orig_elements	bool	No	Store original elements in metadata
tokenizer	str	No	Tokenizer name for token counting

Outputs (Properties)

Name	Type	Description
hard_max	int	Resolved absolute maximum chunk size in characters
soft_max	int	Resolved soft maximum (≤ hard_max)
overlap	int	Resolved overlap in characters
inter_chunk_overlap	int	Computed overlap between consecutive text chunks
combine_text_under_n_chars	int	Resolved minimum section size for by_title

Usage Examples

Inspect Resolved Parameters

from unstructured.chunking.base import ChunkingOptions

opts = ChunkingOptions(
    max_characters=1000,
    new_after_n_chars=800,
    overlap=100,
)

print(f"Hard max: {opts.hard_max}")   # 1000
print(f"Soft max: {opts.soft_max}")   # 800
print(f"Overlap: {opts.overlap}")     # 100

Related Pages

Implements Principle

Principle:Unstructured_IO_Unstructured_Chunk_Size_Configuration

Uses Heuristic

Heuristic:Unstructured_IO_Unstructured_Chunk_Size_Tuning

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment