Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Deepset ai Haystack DocumentSplitter

From Leeroopedia

Template:Metadata

Overview

DocumentSplitter is a Haystack component that splits long documents into smaller chunks for embedding and retrieval. It supports multiple splitting strategies (word, sentence, passage, page, period, line, and custom function), configurable overlap between chunks, and a split threshold to prevent tiny trailing chunks. This component is essential for preparing documents for vector embeddings and retrieval within token limit constraints.

Code Reference

Source file: haystack/components/preprocessors/document_splitter.py, lines 22-206

Import:

from haystack.components.preprocessors import DocumentSplitter

Dependencies: more_itertools

Constructor

DocumentSplitter(
    split_by: Literal["function", "page", "passage", "period", "word", "line", "sentence"] = "word",
    split_length: int = 200,
    split_overlap: int = 0,
    split_threshold: int = 0,
    splitting_function: Callable[[str], list[str]] | None = None,
    respect_sentence_boundary: bool = False,
    language: Language = "en",
    use_split_rules: bool = True,
    extend_abbreviations: bool = True,
    *,
    skip_empty_documents: bool = True
)

Parameters:

  • split_by (str, default "word"): The unit for splitting. Options:
    • "word": Split by spaces.
    • "sentence": Split by NLTK sentence tokenizer.
    • "passage": Split by double newlines (\n\n).
    • "page": Split by form feed characters (\f).
    • "period": Split by periods.
    • "line": Split by single newlines (\n).
    • "function": Use a custom splitting function.
  • split_length (int, default 200): Maximum number of units per split. Must be greater than 0.
  • split_overlap (int, default 0): Number of overlapping units between consecutive splits. Must be non-negative.
  • split_threshold (int, default 0): Minimum units per split. If the final split has fewer units, it is merged with the previous split.
  • splitting_function (Callable | None, default None): Required when split_by="function". A function that accepts a string and returns a list of strings.
  • respect_sentence_boundary (bool, default False): When splitting by word, ensures splits occur between sentences rather than mid-sentence. Only applies when split_by="word".
  • language (str, default "en"): Language for NLTK sentence tokenizer.
  • use_split_rules (bool, default True): Whether to use additional split rules for sentence splitting.
  • extend_abbreviations (bool, default True): Whether to extend NLTK abbreviation lists. Supported for English and German.
  • skip_empty_documents (bool, default True): Whether to skip documents with empty content.

Run Method

run(documents: list[Document]) -> {"documents": list[Document]}

Parameters:

  • documents (list[Document], required): The documents to split.

Raises:

  • TypeError: If input is not a list of Document objects.
  • ValueError: If a document's content is None.

Output Document Metadata

Each output document includes the following metadata fields:

  • source_id: The ID of the original document.
  • page_number: The page number from the original document.
  • split_id: The sequential index of this split.
  • split_idx_start: The character index where this split starts in the original text.
  • _split_overlap: (When overlap > 0) Information about overlapping content with adjacent splits.

I/O Contract

Direction Name Type Description
Input documents list[Document] Documents with text content to split
Output documents list[Document] Split document chunks with metadata tracking original position

Usage Examples

Basic Word Splitting

from haystack import Document
from haystack.components.preprocessors import DocumentSplitter

doc = Document(content="Moonlight shimmered softly, wolves howled nearby, night enveloped everything.")

splitter = DocumentSplitter(split_by="word", split_length=3, split_overlap=0)
result = splitter.run(documents=[doc])
# Produces multiple chunks of 3 words each

Sentence Splitting with Overlap

from haystack import Document
from haystack.components.preprocessors import DocumentSplitter

doc = Document(content="First sentence. Second sentence. Third sentence. Fourth sentence.")

splitter = DocumentSplitter(split_by="sentence", split_length=2, split_overlap=1)
splitter.warm_up()  # Loads NLTK sentence tokenizer
result = splitter.run(documents=[doc])
# Chunk 1: "First sentence. Second sentence."
# Chunk 2: "Second sentence. Third sentence."
# Chunk 3: "Third sentence. Fourth sentence."

Page-Based Splitting

from haystack import Document
from haystack.components.preprocessors import DocumentSplitter

# Form feed characters separate pages (as produced by PDF converters)
doc = Document(content="Page 1 content\fPage 2 content\fPage 3 content")

splitter = DocumentSplitter(split_by="page", split_length=1)
result = splitter.run(documents=[doc])

Word Splitting with Sentence Boundary Respect

from haystack import Document
from haystack.components.preprocessors import DocumentSplitter

doc = Document(content="This is a long document. It has many sentences. Each one matters.")

splitter = DocumentSplitter(
    split_by="word",
    split_length=10,
    split_overlap=2,
    respect_sentence_boundary=True
)
splitter.warm_up()
result = splitter.run(documents=[doc])
# Splits at sentence boundaries, not mid-sentence

Custom Splitting Function

from haystack import Document
from haystack.components.preprocessors import DocumentSplitter

def split_by_markdown_headers(text: str) -> list[str]:
    import re
    sections = re.split(r'\n(?=# )', text)
    return [s for s in sections if s.strip()]

splitter = DocumentSplitter(split_by="function", splitting_function=split_by_markdown_headers)

Pipeline Integration

from haystack import Pipeline
from haystack.components.preprocessors import DocumentCleaner, DocumentSplitter
from haystack.components.converters import TextFileToDocument

pipeline = Pipeline()
pipeline.add_component("converter", TextFileToDocument())
pipeline.add_component("cleaner", DocumentCleaner())
pipeline.add_component("splitter", DocumentSplitter(
    split_by="word",
    split_length=200,
    split_overlap=20,
    split_threshold=10
))

pipeline.connect("converter.documents", "cleaner.documents")
pipeline.connect("cleaner.documents", "splitter.documents")

Related Pages

Implements Principle

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment