Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Deepset ai Haystack Document Splitting

From Leeroopedia

Template:Metadata

Overview

Document Splitting is the principle of dividing long documents into smaller, semantically meaningful chunks. This is a critical preprocessing step in retrieval-augmented generation (RAG) and search pipelines. Splitting enables embedding models to create focused semantic representations for each chunk and prevents exceeding the context window limits of language models.

Description

Modern embedding models and language models operate within fixed token limits. A document that exceeds these limits must be split into smaller pieces, each of which can be independently embedded, indexed, and retrieved. The challenge is to split documents in a way that preserves meaningful context within each chunk while keeping chunks small enough for model consumption.

Document Splitting supports multiple chunking strategies, each suited to different document structures:

Splitting Units

  • Word (split_by="word"): Splits on whitespace boundaries. The most common approach for general text.
  • Sentence (split_by="sentence"): Uses NLTK sentence tokenization to split at sentence boundaries. Preserves complete sentences within each chunk.
  • Passage (split_by="passage"): Splits on double newlines (\n\n), treating paragraphs as natural units.
  • Page (split_by="page"): Splits on form feed characters (\f), preserving page structure from the original document.
  • Period (split_by="period"): Splits on periods, useful for simple sentence-like segmentation.
  • Line (split_by="line"): Splits on single newlines.
  • Function (split_by="function"): Uses a user-supplied function for custom splitting logic.

Overlap

Overlap is the number of units shared between consecutive chunks. When a document is split into chunks A and B, the last N units of A also appear at the beginning of B. This serves several purposes:

  • Context continuity: Information at chunk boundaries is not lost, since it appears in both adjacent chunks.
  • Retrieval robustness: A query that matches content near a chunk boundary has a higher chance of retrieving a relevant chunk.
  • Cross-reference tracking: Overlap metadata records which chunks share content, enabling downstream components to deduplicate or merge results.

Split Threshold

The split threshold prevents the creation of excessively small trailing chunks. If the final chunk would contain fewer units than the threshold, it is appended to the previous chunk instead of becoming a standalone split. This avoids tiny, low-context chunks that would produce poor embeddings.

Sentence Boundary Respect

When splitting by word count, the respect sentence boundary option ensures that splits occur only between sentences rather than in the middle of a sentence. This uses NLTK sentence tokenization to group sentences into chunks that approach (but do not exceed) the target word count.

Usage

Document Splitting is used after document cleaning and before embedding in the indexing pipeline. It is one of the most impactful preprocessing steps for retrieval quality.

[DocumentCleaner] --> [DocumentSplitter] --> [DocumentEmbedder] --> [DocumentWriter/Store]

Theoretical Basis

Document splitting relates to the broader concept of text segmentation in NLP. The choice of chunking strategy directly impacts retrieval quality:

  • Too large: Chunks contain too much irrelevant context, diluting the semantic signal in embeddings. They may also exceed model token limits.
  • Too small: Chunks lack sufficient context for meaningful semantic representation, leading to poor retrieval precision.
  • Optimal: Chunks contain a single coherent topic or idea, producing focused embeddings that match well with relevant queries.

The overlap technique is analogous to the sliding window approach used in signal processing and sequence modeling, where overlapping windows ensure no information is lost at boundaries.

Reciprocal information theory suggests that the information content at chunk boundaries is particularly vulnerable to loss during segmentation. Overlap mitigates this by providing redundant coverage of boundary content.

The split threshold mechanism implements a form of minimum viable chunk size, below which a chunk is considered too small to carry meaningful semantic content on its own.

Related Pages

Implemented By

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment