Principle:Deepset ai Haystack Document Splitting
Overview
Document Splitting is the principle of dividing long documents into smaller, semantically meaningful chunks. This is a critical preprocessing step in retrieval-augmented generation (RAG) and search pipelines. Splitting enables embedding models to create focused semantic representations for each chunk and prevents exceeding the context window limits of language models.
Description
Modern embedding models and language models operate within fixed token limits. A document that exceeds these limits must be split into smaller pieces, each of which can be independently embedded, indexed, and retrieved. The challenge is to split documents in a way that preserves meaningful context within each chunk while keeping chunks small enough for model consumption.
Document Splitting supports multiple chunking strategies, each suited to different document structures:
Splitting Units
- Word (
split_by="word"): Splits on whitespace boundaries. The most common approach for general text. - Sentence (
split_by="sentence"): Uses NLTK sentence tokenization to split at sentence boundaries. Preserves complete sentences within each chunk. - Passage (
split_by="passage"): Splits on double newlines (\n\n), treating paragraphs as natural units. - Page (
split_by="page"): Splits on form feed characters (\f), preserving page structure from the original document. - Period (
split_by="period"): Splits on periods, useful for simple sentence-like segmentation. - Line (
split_by="line"): Splits on single newlines. - Function (
split_by="function"): Uses a user-supplied function for custom splitting logic.
Overlap
Overlap is the number of units shared between consecutive chunks. When a document is split into chunks A and B, the last N units of A also appear at the beginning of B. This serves several purposes:
- Context continuity: Information at chunk boundaries is not lost, since it appears in both adjacent chunks.
- Retrieval robustness: A query that matches content near a chunk boundary has a higher chance of retrieving a relevant chunk.
- Cross-reference tracking: Overlap metadata records which chunks share content, enabling downstream components to deduplicate or merge results.
Split Threshold
The split threshold prevents the creation of excessively small trailing chunks. If the final chunk would contain fewer units than the threshold, it is appended to the previous chunk instead of becoming a standalone split. This avoids tiny, low-context chunks that would produce poor embeddings.
Sentence Boundary Respect
When splitting by word count, the respect sentence boundary option ensures that splits occur only between sentences rather than in the middle of a sentence. This uses NLTK sentence tokenization to group sentences into chunks that approach (but do not exceed) the target word count.
Usage
Document Splitting is used after document cleaning and before embedding in the indexing pipeline. It is one of the most impactful preprocessing steps for retrieval quality.
[DocumentCleaner] --> [DocumentSplitter] --> [DocumentEmbedder] --> [DocumentWriter/Store]
Theoretical Basis
Document splitting relates to the broader concept of text segmentation in NLP. The choice of chunking strategy directly impacts retrieval quality:
- Too large: Chunks contain too much irrelevant context, diluting the semantic signal in embeddings. They may also exceed model token limits.
- Too small: Chunks lack sufficient context for meaningful semantic representation, leading to poor retrieval precision.
- Optimal: Chunks contain a single coherent topic or idea, producing focused embeddings that match well with relevant queries.
The overlap technique is analogous to the sliding window approach used in signal processing and sequence modeling, where overlapping windows ensure no information is lost at boundaries.
Reciprocal information theory suggests that the information content at chunk boundaries is particularly vulnerable to loss during segmentation. Overlap mitigates this by providing redundant coverage of boundary content.
The split threshold mechanism implements a form of minimum viable chunk size, below which a chunk is considered too small to carry meaningful semantic content on its own.
Related Pages
- Deepset_ai_Haystack_DocumentSplitter - Implementation of Document Splitting in Haystack
- Deepset_ai_Haystack_Document_Cleaning - Cleaning documents before splitting
- Deepset_ai_Haystack_Document_Joining - Joining document streams after retrieval
- Deepset_ai_Haystack_Text_File_Conversion - Converting text files before splitting