Implementation:Deepset ai Haystack DocumentSplitter
Overview
DocumentSplitter is a Haystack component that splits long documents into smaller chunks for embedding and retrieval. It supports multiple splitting strategies (word, sentence, passage, page, period, line, and custom function), configurable overlap between chunks, and a split threshold to prevent tiny trailing chunks. This component is essential for preparing documents for vector embeddings and retrieval within token limit constraints.
Code Reference
Source file: haystack/components/preprocessors/document_splitter.py, lines 22-206
Import:
from haystack.components.preprocessors import DocumentSplitter
Dependencies: more_itertools
Constructor
DocumentSplitter(
split_by: Literal["function", "page", "passage", "period", "word", "line", "sentence"] = "word",
split_length: int = 200,
split_overlap: int = 0,
split_threshold: int = 0,
splitting_function: Callable[[str], list[str]] | None = None,
respect_sentence_boundary: bool = False,
language: Language = "en",
use_split_rules: bool = True,
extend_abbreviations: bool = True,
*,
skip_empty_documents: bool = True
)
Parameters:
split_by(str, default"word"): The unit for splitting. Options:"word": Split by spaces."sentence": Split by NLTK sentence tokenizer."passage": Split by double newlines (\n\n)."page": Split by form feed characters (\f)."period": Split by periods."line": Split by single newlines (\n)."function": Use a custom splitting function.
split_length(int, default200): Maximum number of units per split. Must be greater than 0.split_overlap(int, default0): Number of overlapping units between consecutive splits. Must be non-negative.split_threshold(int, default0): Minimum units per split. If the final split has fewer units, it is merged with the previous split.splitting_function(Callable | None, defaultNone): Required whensplit_by="function". A function that accepts a string and returns a list of strings.respect_sentence_boundary(bool, defaultFalse): When splitting by word, ensures splits occur between sentences rather than mid-sentence. Only applies whensplit_by="word".language(str, default"en"): Language for NLTK sentence tokenizer.use_split_rules(bool, defaultTrue): Whether to use additional split rules for sentence splitting.extend_abbreviations(bool, defaultTrue): Whether to extend NLTK abbreviation lists. Supported for English and German.skip_empty_documents(bool, defaultTrue): Whether to skip documents with empty content.
Run Method
run(documents: list[Document]) -> {"documents": list[Document]}
Parameters:
documents(list[Document], required): The documents to split.
Raises:
TypeError: If input is not a list of Document objects.ValueError: If a document's content is None.
Output Document Metadata
Each output document includes the following metadata fields:
source_id: The ID of the original document.page_number: The page number from the original document.split_id: The sequential index of this split.split_idx_start: The character index where this split starts in the original text._split_overlap: (When overlap > 0) Information about overlapping content with adjacent splits.
I/O Contract
| Direction | Name | Type | Description |
|---|---|---|---|
| Input | documents | list[Document] | Documents with text content to split |
| Output | documents | list[Document] | Split document chunks with metadata tracking original position |
Usage Examples
Basic Word Splitting
from haystack import Document
from haystack.components.preprocessors import DocumentSplitter
doc = Document(content="Moonlight shimmered softly, wolves howled nearby, night enveloped everything.")
splitter = DocumentSplitter(split_by="word", split_length=3, split_overlap=0)
result = splitter.run(documents=[doc])
# Produces multiple chunks of 3 words each
Sentence Splitting with Overlap
from haystack import Document
from haystack.components.preprocessors import DocumentSplitter
doc = Document(content="First sentence. Second sentence. Third sentence. Fourth sentence.")
splitter = DocumentSplitter(split_by="sentence", split_length=2, split_overlap=1)
splitter.warm_up() # Loads NLTK sentence tokenizer
result = splitter.run(documents=[doc])
# Chunk 1: "First sentence. Second sentence."
# Chunk 2: "Second sentence. Third sentence."
# Chunk 3: "Third sentence. Fourth sentence."
Page-Based Splitting
from haystack import Document
from haystack.components.preprocessors import DocumentSplitter
# Form feed characters separate pages (as produced by PDF converters)
doc = Document(content="Page 1 content\fPage 2 content\fPage 3 content")
splitter = DocumentSplitter(split_by="page", split_length=1)
result = splitter.run(documents=[doc])
Word Splitting with Sentence Boundary Respect
from haystack import Document
from haystack.components.preprocessors import DocumentSplitter
doc = Document(content="This is a long document. It has many sentences. Each one matters.")
splitter = DocumentSplitter(
split_by="word",
split_length=10,
split_overlap=2,
respect_sentence_boundary=True
)
splitter.warm_up()
result = splitter.run(documents=[doc])
# Splits at sentence boundaries, not mid-sentence
Custom Splitting Function
from haystack import Document
from haystack.components.preprocessors import DocumentSplitter
def split_by_markdown_headers(text: str) -> list[str]:
import re
sections = re.split(r'\n(?=# )', text)
return [s for s in sections if s.strip()]
splitter = DocumentSplitter(split_by="function", splitting_function=split_by_markdown_headers)
Pipeline Integration
from haystack import Pipeline
from haystack.components.preprocessors import DocumentCleaner, DocumentSplitter
from haystack.components.converters import TextFileToDocument
pipeline = Pipeline()
pipeline.add_component("converter", TextFileToDocument())
pipeline.add_component("cleaner", DocumentCleaner())
pipeline.add_component("splitter", DocumentSplitter(
split_by="word",
split_length=200,
split_overlap=20,
split_threshold=10
))
pipeline.connect("converter.documents", "cleaner.documents")
pipeline.connect("cleaner.documents", "splitter.documents")
Related Pages
Implements Principle
- Deepset_ai_Haystack_Document_Splitting - The principle behind document splitting and chunking strategies
- Deepset_ai_Haystack_DocumentCleaner - Cleans documents before splitting
- Deepset_ai_Haystack_DocumentJoiner - Joins document streams after retrieval
- Deepset_ai_Haystack_TextFileToDocument - Converts text files to documents