Principle:Unstructured IO Unstructured Chunk Size Configuration

Knowledge Sources	Unstructured Unstructured Docs
Domains	Document_Processing, RAG, Configuration
Last Updated	2026-02-12 00:00 GMT

Overview

A configuration mechanism that validates and resolves chunking size parameters (character limits, token limits, overlap) into a consistent set of constraints for the chunking engine.

Description

Chunking size configuration handles the complexity of specifying chunk size constraints. Users can specify sizes in characters or tokens, set hard and soft limits, configure overlap, and choose tokenizers. These parameters interact: token-based limits must be converted to character estimates, soft limits cannot exceed hard limits, and overlap cannot exceed chunk size.

The configuration layer validates all parameters, resolves defaults, computes derived values (e.g., converting token limits to approximate character limits), and provides a clean interface for the chunking algorithms to query size constraints.

Usage

Use this principle when you need to understand how chunk size parameters interact and are validated. It is the configuration layer that sits between user-facing chunking functions (chunk_elements, chunk_by_title) and the core chunking engine. Understanding this layer is essential for tuning chunk sizes for specific embedding models or retrieval requirements.

Theoretical Basis

Chunk size configuration resolves a parameter hierarchy:

Hard max (max_characters / max_tokens): The absolute ceiling. No chunk will exceed this size. Elements that are individually larger than hard_max are split mid-text at word boundaries.

Soft max (new_after_n_chars / new_after_n_tokens): The target size. Once a chunk reaches this size, the next element boundary triggers a new chunk. Must be ≤ hard_max.

Overlap: Characters from the end of one chunk are prepended to the beginning of the next. Must be < hard_max.

Token conversion: When token-based limits are specified, they are converted to character estimates using the specified tokenizer's average characters-per-token ratio.

# Abstract configuration resolution
if max_tokens is specified:
    hard_max = max_tokens * avg_chars_per_token
elif max_characters is specified:
    hard_max = max_characters
else:
    hard_max = 500  # default

if soft_max > hard_max:
    soft_max = hard_max
if overlap >= hard_max:
    raise ValueError("overlap must be less than max_characters")

Related Pages

Implemented By

Implementation:Unstructured_IO_Unstructured_ChunkingOptions

Uses Heuristic

Heuristic:Unstructured_IO_Unstructured_Chunk_Size_Tuning

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment