Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Deepset ai Haystack Document Splitting Defaults

From Leeroopedia
Knowledge Sources
Domains Preprocessing, Optimization
Last Updated 2026-02-11 20:00 GMT

Overview

Document splitting defaults to 200 words with no overlap; use sentence boundary respect for semantic coherence, and set `skip_empty_documents=False` when downstream LLM extractors process non-text content.

Description

The DocumentSplitter component breaks documents into smaller chunks for embedding and retrieval. Its default configuration reflects empirical best practices: `split_length=200` words provides chunks large enough for semantic meaning but small enough for efficient embedding. The `split_overlap=0` default avoids redundancy, while `split_threshold=0` means no minimum chunk size (all chunks are kept). The `respect_sentence_boundary` option uses NLTK to ensure word-based splits never cut mid-sentence, preserving context at chunk boundaries.

Usage

Apply this heuristic when configuring document splitting for RAG pipelines, tuning chunk sizes for retrieval quality, or processing documents that contain non-text content (images, tables). The defaults work well for general-purpose English text; adjust for specific domains or languages.

The Insight (Rule of Thumb)

  • Action: Use `split_by="word"` with `split_length=200` as a starting point.
  • Value: Default values: `split_length=200`, `split_overlap=0`, `split_threshold=0`.
  • Trade-off: Larger chunks retain more context but reduce retrieval precision; smaller chunks improve precision but may lose context.
  • Sentence boundaries: Set `respect_sentence_boundary=True` when semantic coherence matters more than uniform chunk sizes.
  • Empty documents: Set `skip_empty_documents=False` when using `LLMDocumentContentExtractor` to process non-textual documents.
  • Overlap tracking: When using overlap, chunks store `_split_overlap` metadata tracking which other documents share content.

Reasoning

The 200-word default balances several competing concerns:

  1. Embedding quality: Most embedding models have a context window of 256-512 tokens (~200-400 words). Chunks exceeding this are silently truncated by the model, losing information.
  2. Retrieval precision: Smaller chunks mean more specific matches, but overly small chunks lose context needed for answer generation.
  3. LLM context usage: In RAG pipelines, retrieved chunks are concatenated into the prompt. Chunks of ~200 words allow fitting 5-10 relevant passages in a typical 4K-8K context window.

The `respect_sentence_boundary=True` option adds a dependency on NLTK but prevents awkward splits mid-sentence that confuse both embedders and LLMs.

Code evidence from `haystack/components/preprocessors/document_splitter.py:54-66`:

def __init__(
    self,
    split_by: Literal["function", "page", "passage", "period", "word", "line", "sentence"] = "word",
    split_length: int = 200,
    split_overlap: int = 0,
    split_threshold: int = 0,
    splitting_function: Callable[[str], list[str]] | None = None,
    respect_sentence_boundary: bool = False,
    language: Language = "en",
    use_split_rules: bool = True,
    extend_abbreviations: bool = True,
    *,
    skip_empty_documents: bool = True,
):

Skip empty documents guidance from `document_splitter.py:92-94`:

:param skip_empty_documents: Choose whether to skip documents with empty content. Default is True.
    Set to False when downstream components in the Pipeline (like LLMDocumentContentExtractor) can extract text
    from non-textual documents.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment