Heuristic:Langchain ai Langchain Text Splitter Separator Hierarchy
| Knowledge Sources | |
|---|---|
| Domains | NLP, Document_Processing |
| Last Updated | 2026-02-11 14:00 GMT |
Overview
The `RecursiveCharacterTextSplitter` uses a hierarchical separator strategy (`\n\n` -> `\n` -> ` ` -> `""`) to preserve document structure while respecting chunk size limits.
Description
When splitting text into chunks, the order of separators matters. The `RecursiveCharacterTextSplitter` tries separators from most structural (paragraph break) to least structural (character-level), choosing the first separator that produces chunks within the configured size. The empty string `""` as the final separator guarantees all text is processed, even if no natural break points exist within the chunk size.
The default chunk size of 4000 characters with 200-character overlap provides a reasonable starting point for most LLM context windows. However, these values should be tuned based on the specific model's token-to-character ratio and the nature of the documents.
Usage
Apply this heuristic when configuring text splitting for RAG pipelines or document indexing. The default separators work well for plain text and markdown. Override the separator list for structured formats (e.g., code, HTML, JSON).
The Insight (Rule of Thumb)
- Action: Use the default separator hierarchy `["\n\n", "\n", " ", ""]` for general text. Override for specific formats.
- Value: `chunk_size=4000`, `chunk_overlap=200` as sensible defaults.
- Trade-off: Smaller chunks improve retrieval precision but increase storage and embedding costs. Larger chunks preserve more context but may dilute relevance scores.
- Validation: `chunk_overlap` must be strictly less than `chunk_size`. Both must be positive.
Reasoning
The separator hierarchy preserves document semantics:
- `"\n\n"` (paragraph): Splits at paragraph boundaries — preserves complete thoughts.
- `"\n"` (line): Falls back to line breaks — preserves sentence-level structure.
- `" "` (space): Falls back to word boundaries — avoids splitting mid-word.
- `""` (character): Last resort — splits at every character to guarantee chunk size compliance.
When a chunk exceeds the configured size despite splitting, a warning is logged (not an error), because some content blocks (e.g., long URLs, code blocks) cannot be split at natural boundaries.
Code evidence from `libs/text-splitters/langchain_text_splitters/character.py:104`:
self._separators = separators or ["\n\n", "\n", " ", ""]
Chunk size validation from `libs/text-splitters/langchain_text_splitters/base.py:73-84`:
if chunk_size <= 0:
msg = f"chunk_size must be > 0, got {chunk_size}"
raise ValueError(msg)
if chunk_overlap < 0:
msg = f"chunk_overlap must be >= 0, got {chunk_overlap}"
raise ValueError(msg)
if chunk_overlap > chunk_size:
msg = (
f"Got a larger chunk overlap ({chunk_overlap}) than chunk size "
f"({chunk_size}), should be smaller."
)
raise ValueError(msg)
Overflow warning from `libs/text-splitters/langchain_text_splitters/base.py:166-172`:
if total > self._chunk_size:
logger.warning(
"Created a chunk of size %d, which is longer than the "
"specified %d",
total,
self._chunk_size,
)