Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Unstructured IO Unstructured Chunk Size Configuration

From Leeroopedia
Knowledge Sources
Domains Document_Processing, RAG, Configuration
Last Updated 2026-02-12 00:00 GMT

Overview

A configuration mechanism that validates and resolves chunking size parameters (character limits, token limits, overlap) into a consistent set of constraints for the chunking engine.

Description

Chunking size configuration handles the complexity of specifying chunk size constraints. Users can specify sizes in characters or tokens, set hard and soft limits, configure overlap, and choose tokenizers. These parameters interact: token-based limits must be converted to character estimates, soft limits cannot exceed hard limits, and overlap cannot exceed chunk size.

The configuration layer validates all parameters, resolves defaults, computes derived values (e.g., converting token limits to approximate character limits), and provides a clean interface for the chunking algorithms to query size constraints.

Usage

Use this principle when you need to understand how chunk size parameters interact and are validated. It is the configuration layer that sits between user-facing chunking functions (chunk_elements, chunk_by_title) and the core chunking engine. Understanding this layer is essential for tuning chunk sizes for specific embedding models or retrieval requirements.

Theoretical Basis

Chunk size configuration resolves a parameter hierarchy:

Hard max (max_characters / max_tokens): The absolute ceiling. No chunk will exceed this size. Elements that are individually larger than hard_max are split mid-text at word boundaries.

Soft max (new_after_n_chars / new_after_n_tokens): The target size. Once a chunk reaches this size, the next element boundary triggers a new chunk. Must be ≤ hard_max.

Overlap: Characters from the end of one chunk are prepended to the beginning of the next. Must be < hard_max.

Token conversion: When token-based limits are specified, they are converted to character estimates using the specified tokenizer's average characters-per-token ratio.

# Abstract configuration resolution
if max_tokens is specified:
    hard_max = max_tokens * avg_chars_per_token
elif max_characters is specified:
    hard_max = max_characters
else:
    hard_max = 500  # default

if soft_max > hard_max:
    soft_max = hard_max
if overlap >= hard_max:
    raise ValueError("overlap must be less than max_characters")

Related Pages

Implemented By

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment