Principle:Unstructured IO Unstructured Chunk Size Configuration
| Knowledge Sources | |
|---|---|
| Domains | Document_Processing, RAG, Configuration |
| Last Updated | 2026-02-12 00:00 GMT |
Overview
A configuration mechanism that validates and resolves chunking size parameters (character limits, token limits, overlap) into a consistent set of constraints for the chunking engine.
Description
Chunking size configuration handles the complexity of specifying chunk size constraints. Users can specify sizes in characters or tokens, set hard and soft limits, configure overlap, and choose tokenizers. These parameters interact: token-based limits must be converted to character estimates, soft limits cannot exceed hard limits, and overlap cannot exceed chunk size.
The configuration layer validates all parameters, resolves defaults, computes derived values (e.g., converting token limits to approximate character limits), and provides a clean interface for the chunking algorithms to query size constraints.
Usage
Use this principle when you need to understand how chunk size parameters interact and are validated. It is the configuration layer that sits between user-facing chunking functions (chunk_elements, chunk_by_title) and the core chunking engine. Understanding this layer is essential for tuning chunk sizes for specific embedding models or retrieval requirements.
Theoretical Basis
Chunk size configuration resolves a parameter hierarchy:
Hard max (max_characters / max_tokens): The absolute ceiling. No chunk will exceed this size. Elements that are individually larger than hard_max are split mid-text at word boundaries.
Soft max (new_after_n_chars / new_after_n_tokens): The target size. Once a chunk reaches this size, the next element boundary triggers a new chunk. Must be ≤ hard_max.
Overlap: Characters from the end of one chunk are prepended to the beginning of the next. Must be < hard_max.
Token conversion: When token-based limits are specified, they are converted to character estimates using the specified tokenizer's average characters-per-token ratio.
# Abstract configuration resolution
if max_tokens is specified:
hard_max = max_tokens * avg_chars_per_token
elif max_characters is specified:
hard_max = max_characters
else:
hard_max = 500 # default
if soft_max > hard_max:
soft_max = hard_max
if overlap >= hard_max:
raise ValueError("overlap must be less than max_characters")