Implementation:Unstructured IO Unstructured ChunkingOptions
| Knowledge Sources | |
|---|---|
| Domains | Document_Processing, RAG, Configuration |
| Last Updated | 2026-02-12 00:00 GMT |
Overview
Concrete tool for validating and resolving chunking size parameters provided by the Unstructured library.
Description
The ChunkingOptions class accepts raw chunking parameters (max_characters, max_tokens, overlap, etc.), validates them, resolves defaults, and exposes computed properties (hard_max, soft_max, overlap, combine_text_under_n_chars). It supports both character-based and token-based sizing, with optional tiktoken integration for accurate token counting.
Usage
This class is used internally by chunk_elements and chunk_by_title. Import it directly only when you need to pre-validate chunking parameters, inspect computed size limits, or build custom chunking logic on top of the same parameter resolution infrastructure.
Code Reference
Source Location
- Repository: unstructured
- File: unstructured/chunking/base.py
- Lines: 81-333
Signature
class ChunkingOptions:
def __init__(self, **kwargs: Any):
"""Initialize chunking options from keyword arguments.
Supported kwargs:
max_characters (int): Hard max chunk size in characters (default 500).
new_after_n_chars (int): Soft max to trigger new chunk.
max_tokens (int): Hard max chunk size in tokens.
new_after_n_tokens (int): Soft max in tokens.
overlap (int): Character overlap between chunks.
overlap_all (bool): Apply overlap to all chunk types.
combine_text_under_n_chars (int): Merge small sections (by_title only).
multipage_sections (bool): Allow cross-page chunks (by_title only).
include_orig_elements (bool): Store source elements in metadata.
tokenizer (str): Tokenizer name for token-based chunking.
"""
@classmethod
def new(cls, **kwargs: Any) -> Self:
"""Factory method that returns the appropriate subclass instance."""
Import
from unstructured.chunking.base import ChunkingOptions
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| max_characters | int | No | Hard max chunk size in characters (default 500) |
| new_after_n_chars | int | No | Soft max to trigger new chunk |
| max_tokens | int | No | Hard max in tokens (requires tiktoken) |
| new_after_n_tokens | int | No | Soft max in tokens |
| overlap | int | No | Character overlap between chunks |
| overlap_all | bool | No | Apply overlap to all chunk types |
| combine_text_under_n_chars | int | No | Merge small sections (by_title only) |
| multipage_sections | bool | No | Allow cross-page chunks (default True) |
| include_orig_elements | bool | No | Store original elements in metadata |
| tokenizer | str | No | Tokenizer name for token counting |
Outputs (Properties)
| Name | Type | Description |
|---|---|---|
| hard_max | int | Resolved absolute maximum chunk size in characters |
| soft_max | int | Resolved soft maximum (≤ hard_max) |
| overlap | int | Resolved overlap in characters |
| inter_chunk_overlap | int | Computed overlap between consecutive text chunks |
| combine_text_under_n_chars | int | Resolved minimum section size for by_title |
Usage Examples
Inspect Resolved Parameters
from unstructured.chunking.base import ChunkingOptions
opts = ChunkingOptions(
max_characters=1000,
new_after_n_chars=800,
overlap=100,
)
print(f"Hard max: {opts.hard_max}") # 1000
print(f"Soft max: {opts.soft_max}") # 800
print(f"Overlap: {opts.overlap}") # 100