Heuristic:Deepset ai Haystack Document Splitting Defaults
| Knowledge Sources | |
|---|---|
| Domains | Preprocessing, Optimization |
| Last Updated | 2026-02-11 20:00 GMT |
Overview
Document splitting defaults to 200 words with no overlap; use sentence boundary respect for semantic coherence, and set `skip_empty_documents=False` when downstream LLM extractors process non-text content.
Description
The DocumentSplitter component breaks documents into smaller chunks for embedding and retrieval. Its default configuration reflects empirical best practices: `split_length=200` words provides chunks large enough for semantic meaning but small enough for efficient embedding. The `split_overlap=0` default avoids redundancy, while `split_threshold=0` means no minimum chunk size (all chunks are kept). The `respect_sentence_boundary` option uses NLTK to ensure word-based splits never cut mid-sentence, preserving context at chunk boundaries.
Usage
Apply this heuristic when configuring document splitting for RAG pipelines, tuning chunk sizes for retrieval quality, or processing documents that contain non-text content (images, tables). The defaults work well for general-purpose English text; adjust for specific domains or languages.
The Insight (Rule of Thumb)
- Action: Use `split_by="word"` with `split_length=200` as a starting point.
- Value: Default values: `split_length=200`, `split_overlap=0`, `split_threshold=0`.
- Trade-off: Larger chunks retain more context but reduce retrieval precision; smaller chunks improve precision but may lose context.
- Sentence boundaries: Set `respect_sentence_boundary=True` when semantic coherence matters more than uniform chunk sizes.
- Empty documents: Set `skip_empty_documents=False` when using `LLMDocumentContentExtractor` to process non-textual documents.
- Overlap tracking: When using overlap, chunks store `_split_overlap` metadata tracking which other documents share content.
Reasoning
The 200-word default balances several competing concerns:
- Embedding quality: Most embedding models have a context window of 256-512 tokens (~200-400 words). Chunks exceeding this are silently truncated by the model, losing information.
- Retrieval precision: Smaller chunks mean more specific matches, but overly small chunks lose context needed for answer generation.
- LLM context usage: In RAG pipelines, retrieved chunks are concatenated into the prompt. Chunks of ~200 words allow fitting 5-10 relevant passages in a typical 4K-8K context window.
The `respect_sentence_boundary=True` option adds a dependency on NLTK but prevents awkward splits mid-sentence that confuse both embedders and LLMs.
Code evidence from `haystack/components/preprocessors/document_splitter.py:54-66`:
def __init__(
self,
split_by: Literal["function", "page", "passage", "period", "word", "line", "sentence"] = "word",
split_length: int = 200,
split_overlap: int = 0,
split_threshold: int = 0,
splitting_function: Callable[[str], list[str]] | None = None,
respect_sentence_boundary: bool = False,
language: Language = "en",
use_split_rules: bool = True,
extend_abbreviations: bool = True,
*,
skip_empty_documents: bool = True,
):
Skip empty documents guidance from `document_splitter.py:92-94`:
:param skip_empty_documents: Choose whether to skip documents with empty content. Default is True.
Set to False when downstream components in the Pipeline (like LLMDocumentContentExtractor) can extract text
from non-textual documents.