Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Unstructured IO Unstructured Chunk Size Tuning

From Leeroopedia
Knowledge Sources
Domains Chunking, Text Splitting, Configuration Management
Last Updated 2026-02-12 09:00 GMT

Overview

Chunk size parameters have complex interdependencies and defaults that vary by chunking strategy; misconfiguration leads to silent over-chunking, infinite loops, or dropped HTML table structure.

Description

The chunking subsystem centers on CHUNK_MAX_CHARS_DEFAULT = 500 (base.py:29) as the global maximum characters per chunk. However, this default should never be applied by external code directly -- only ChunkingOptions.max_characters should apply it, ensuring a single point of default resolution.

Key tuning parameters and their interactions:

  • combine_text_under_n_chars exists as a remedy for over-chunking caused by mis-identified Title elements. In the by_title strategy, it defaults to max_characters (title.py:151-153), which aggressively combines small sections. In the basic strategy, it defaults to 0 (no combining).
  • Setting combine_text_under_n_chars > max_characters raises a ValueError (title.py:174-177) to prevent confusing behavior where combining would exceed the maximum chunk size.
  • Overlap must be strictly less than max_characters, otherwise chunk text is never consumed and the splitter loops indefinitely (base.py:327).
  • The text separator between combined elements is "\n\n" (base.py:248-255), which adds 2 characters of overhead per join.
  • For HTML tables, if max_characters < 50, the HTML splitting logic is abandoned entirely in favor of text-only splitting (base.py:851-852), because very small chunks cannot meaningfully preserve HTML table structure.
  • Boundary predicates used during splitting are stateful -- they must not be evaluated with any() short-circuit logic because each predicate may update internal state on every call (base.py:408-412).
  • Token counting via tiktoken is lazy-loaded to avoid import-time overhead and missing-dependency errors (base.py:50-69).

Usage

Apply this heuristic when:

  • Configuring chunk sizes for downstream embedding models or LLM context windows.
  • Debugging unexpected chunk sizes, especially very small chunks from the by_title strategy.
  • Working with HTML tables that must preserve structural markup in chunks.
  • Implementing custom boundary predicates for chunk splitting.

The Insight (Rule of Thumb)

  • Action: Always configure chunking through ChunkingOptions rather than applying defaults externally. Set combine_text_under_n_chars explicitly when using by_title to control small-section merging. Keep max_characters >= 50 when processing HTML tables.
  • Value: CHUNK_MAX_CHARS_DEFAULT = 500; by_title defaults combine_text_under_n_chars to max_characters; basic defaults it to 0; text separator is "\n\n" (2 chars overhead); HTML table splitting requires max_characters >= 50.
  • Trade-off: Larger max_characters produces fewer, more contextual chunks but may exceed downstream token limits. Smaller values fragment tables and lose HTML structure. Aggressive combining (high combine_text_under_n_chars) merges semantically distinct sections.

Reasoning

Chunk size tuning is critical because downstream consumers (embedding models, LLMs) have fixed context windows. The library provides multiple knobs that interact in non-obvious ways. The by_title vs basic difference in combine_text_under_n_chars defaults is particularly surprising: by_title aggressively merges small sections (defaulting to max_characters), while basic does not merge at all (defaulting to 0). The ValueError guard on combine > max prevents a configuration that would produce chunks exceeding the stated maximum, which would be confusing and break downstream constraints. The stateful boundary predicate requirement is a subtle correctness concern -- using Python's any() would short-circuit evaluation and skip state updates in later predicates, producing incorrect split points.

Code Evidence

Default chunk size and the single-point-of-default rule (base.py:29):

# base.py:29
CHUNK_MAX_CHARS_DEFAULT = 500
# NOTE: External code should NOT apply this default themselves.
# Only ChunkingOptions.max_characters should apply it.

combine_text_under_n_chars defaults differ by strategy (title.py:151-153, 174-177):

# title.py:151-153 - by_title default
combine_text_under_n_chars = (
    max_characters if combine_text_under_n_chars is None else combine_text_under_n_chars
)

# title.py:174-177 - guard against misconfiguration
if combine_text_under_n_chars > max_characters:
    raise ValueError(
        f"combine_text_under_n_chars ({combine_text_under_n_chars}) must not exceed"
        f" max_characters ({max_characters})"
    )

Stateful boundary predicates must not short-circuit (base.py:408-412):

# base.py:408-412 - boundary predicates are STATEFUL
# -- Do NOT use any() here because predicates update internal state
# -- on each call. Short-circuiting would skip state updates.
results = [pred(element) for pred in boundary_predicates]
is_boundary = any(results)

HTML table splitting abandoned below 50 chars (base.py:851-852):

# base.py:851-852
if max_characters < 50:
    # HTML splitting abandoned; fall back to text-only
    return self._split_text_only(text, max_characters)

Overlap must be less than max_characters (base.py:327):

# base.py:327
# overlap >= max_characters means chunk text is never consumed
assert overlap < max_characters, (
    f"overlap ({overlap}) must be less than max_characters ({max_characters})"
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment