Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Langchain ai Langchain Text Splitter Separator Hierarchy

From Leeroopedia
Knowledge Sources
Domains NLP, Document_Processing
Last Updated 2026-02-11 14:00 GMT

Overview

The `RecursiveCharacterTextSplitter` uses a hierarchical separator strategy (`\n\n` -> `\n` -> ` ` -> `""`) to preserve document structure while respecting chunk size limits.

Description

When splitting text into chunks, the order of separators matters. The `RecursiveCharacterTextSplitter` tries separators from most structural (paragraph break) to least structural (character-level), choosing the first separator that produces chunks within the configured size. The empty string `""` as the final separator guarantees all text is processed, even if no natural break points exist within the chunk size.

The default chunk size of 4000 characters with 200-character overlap provides a reasonable starting point for most LLM context windows. However, these values should be tuned based on the specific model's token-to-character ratio and the nature of the documents.

Usage

Apply this heuristic when configuring text splitting for RAG pipelines or document indexing. The default separators work well for plain text and markdown. Override the separator list for structured formats (e.g., code, HTML, JSON).

The Insight (Rule of Thumb)

  • Action: Use the default separator hierarchy `["\n\n", "\n", " ", ""]` for general text. Override for specific formats.
  • Value: `chunk_size=4000`, `chunk_overlap=200` as sensible defaults.
  • Trade-off: Smaller chunks improve retrieval precision but increase storage and embedding costs. Larger chunks preserve more context but may dilute relevance scores.
  • Validation: `chunk_overlap` must be strictly less than `chunk_size`. Both must be positive.

Reasoning

The separator hierarchy preserves document semantics:

  1. `"\n\n"` (paragraph): Splits at paragraph boundaries — preserves complete thoughts.
  2. `"\n"` (line): Falls back to line breaks — preserves sentence-level structure.
  3. `" "` (space): Falls back to word boundaries — avoids splitting mid-word.
  4. `""` (character): Last resort — splits at every character to guarantee chunk size compliance.

When a chunk exceeds the configured size despite splitting, a warning is logged (not an error), because some content blocks (e.g., long URLs, code blocks) cannot be split at natural boundaries.

Code evidence from `libs/text-splitters/langchain_text_splitters/character.py:104`:

self._separators = separators or ["\n\n", "\n", " ", ""]

Chunk size validation from `libs/text-splitters/langchain_text_splitters/base.py:73-84`:

if chunk_size <= 0:
    msg = f"chunk_size must be > 0, got {chunk_size}"
    raise ValueError(msg)
if chunk_overlap < 0:
    msg = f"chunk_overlap must be >= 0, got {chunk_overlap}"
    raise ValueError(msg)
if chunk_overlap > chunk_size:
    msg = (
        f"Got a larger chunk overlap ({chunk_overlap}) than chunk size "
        f"({chunk_size}), should be smaller."
    )
    raise ValueError(msg)

Overflow warning from `libs/text-splitters/langchain_text_splitters/base.py:166-172`:

if total > self._chunk_size:
    logger.warning(
        "Created a chunk of size %d, which is longer than the "
        "specified %d",
        total,
        self._chunk_size,
    )

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment