Heuristic:Unstructured IO Unstructured Chunk Size Tuning
| Knowledge Sources | |
|---|---|
| Domains | Chunking, Text Splitting, Configuration Management |
| Last Updated | 2026-02-12 09:00 GMT |
Overview
Chunk size parameters have complex interdependencies and defaults that vary by chunking strategy; misconfiguration leads to silent over-chunking, infinite loops, or dropped HTML table structure.
Description
The chunking subsystem centers on CHUNK_MAX_CHARS_DEFAULT = 500 (base.py:29) as the global maximum characters per chunk. However, this default should never be applied by external code directly -- only ChunkingOptions.max_characters should apply it, ensuring a single point of default resolution.
Key tuning parameters and their interactions:
- combine_text_under_n_chars exists as a remedy for over-chunking caused by mis-identified Title elements. In the by_title strategy, it defaults to max_characters (title.py:151-153), which aggressively combines small sections. In the basic strategy, it defaults to 0 (no combining).
- Setting combine_text_under_n_chars > max_characters raises a ValueError (title.py:174-177) to prevent confusing behavior where combining would exceed the maximum chunk size.
- Overlap must be strictly less than max_characters, otherwise chunk text is never consumed and the splitter loops indefinitely (base.py:327).
- The text separator between combined elements is "\n\n" (base.py:248-255), which adds 2 characters of overhead per join.
- For HTML tables, if max_characters < 50, the HTML splitting logic is abandoned entirely in favor of text-only splitting (base.py:851-852), because very small chunks cannot meaningfully preserve HTML table structure.
- Boundary predicates used during splitting are stateful -- they must not be evaluated with any() short-circuit logic because each predicate may update internal state on every call (base.py:408-412).
- Token counting via tiktoken is lazy-loaded to avoid import-time overhead and missing-dependency errors (base.py:50-69).
Usage
Apply this heuristic when:
- Configuring chunk sizes for downstream embedding models or LLM context windows.
- Debugging unexpected chunk sizes, especially very small chunks from the by_title strategy.
- Working with HTML tables that must preserve structural markup in chunks.
- Implementing custom boundary predicates for chunk splitting.
The Insight (Rule of Thumb)
- Action: Always configure chunking through ChunkingOptions rather than applying defaults externally. Set combine_text_under_n_chars explicitly when using by_title to control small-section merging. Keep max_characters >= 50 when processing HTML tables.
- Value: CHUNK_MAX_CHARS_DEFAULT = 500; by_title defaults combine_text_under_n_chars to max_characters; basic defaults it to 0; text separator is "\n\n" (2 chars overhead); HTML table splitting requires max_characters >= 50.
- Trade-off: Larger max_characters produces fewer, more contextual chunks but may exceed downstream token limits. Smaller values fragment tables and lose HTML structure. Aggressive combining (high combine_text_under_n_chars) merges semantically distinct sections.
Reasoning
Chunk size tuning is critical because downstream consumers (embedding models, LLMs) have fixed context windows. The library provides multiple knobs that interact in non-obvious ways. The by_title vs basic difference in combine_text_under_n_chars defaults is particularly surprising: by_title aggressively merges small sections (defaulting to max_characters), while basic does not merge at all (defaulting to 0). The ValueError guard on combine > max prevents a configuration that would produce chunks exceeding the stated maximum, which would be confusing and break downstream constraints. The stateful boundary predicate requirement is a subtle correctness concern -- using Python's any() would short-circuit evaluation and skip state updates in later predicates, producing incorrect split points.
Code Evidence
Default chunk size and the single-point-of-default rule (base.py:29):
# base.py:29
CHUNK_MAX_CHARS_DEFAULT = 500
# NOTE: External code should NOT apply this default themselves.
# Only ChunkingOptions.max_characters should apply it.
combine_text_under_n_chars defaults differ by strategy (title.py:151-153, 174-177):
# title.py:151-153 - by_title default
combine_text_under_n_chars = (
max_characters if combine_text_under_n_chars is None else combine_text_under_n_chars
)
# title.py:174-177 - guard against misconfiguration
if combine_text_under_n_chars > max_characters:
raise ValueError(
f"combine_text_under_n_chars ({combine_text_under_n_chars}) must not exceed"
f" max_characters ({max_characters})"
)
Stateful boundary predicates must not short-circuit (base.py:408-412):
# base.py:408-412 - boundary predicates are STATEFUL
# -- Do NOT use any() here because predicates update internal state
# -- on each call. Short-circuiting would skip state updates.
results = [pred(element) for pred in boundary_predicates]
is_boundary = any(results)
HTML table splitting abandoned below 50 chars (base.py:851-852):
# base.py:851-852
if max_characters < 50:
# HTML splitting abandoned; fall back to text-only
return self._split_text_only(text, max_characters)
Overlap must be less than max_characters (base.py:327):
# base.py:327
# overlap >= max_characters means chunk text is never consumed
assert overlap < max_characters, (
f"overlap ({overlap}) must be less than max_characters ({max_characters})"
)
Related Pages
- Implementation:Unstructured_IO_Unstructured_ChunkingOptions
- Implementation:Unstructured_IO_Unstructured_Chunk_Elements
- Implementation:Unstructured_IO_Unstructured_Chunk_By_Title
- Principle:Unstructured_IO_Unstructured_Chunk_Size_Configuration
- Principle:Unstructured_IO_Unstructured_Basic_Chunking
- Principle:Unstructured_IO_Unstructured_Section_Aware_Chunking