Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Deepset ai Haystack DocumentCleaner

From Leeroopedia

Template:Metadata

Overview

DocumentCleaner is a Haystack component that cleans text content in Document objects by removing extra whitespace, empty lines, specified substrings, regex-matched patterns, and repeated headers and footers. It applies a configurable sequence of text transformations to improve the quality of documents for downstream processing.

Code Reference

Source file: haystack/components/preprocessors/document_cleaner.py, lines 18-159

Import:

from haystack.components.preprocessors import DocumentCleaner

Constructor

DocumentCleaner(
    remove_empty_lines: bool = True,
    remove_extra_whitespaces: bool = True,
    remove_repeated_substrings: bool = False,
    keep_id: bool = False,
    remove_substrings: list[str] | None = None,
    remove_regex: str | None = None,
    unicode_normalization: Literal["NFC", "NFKC", "NFD", "NFKD"] | None = None,
    ascii_only: bool = False,
    strip_whitespaces: bool = False,
    replace_regexes: dict[str, str] | None = None
)

Parameters:

  • remove_empty_lines (bool, default True): If True, removes lines containing only whitespace.
  • remove_extra_whitespaces (bool, default True): If True, collapses multiple consecutive whitespace characters into a single space.
  • remove_repeated_substrings (bool, default False): If True, removes text repeated across pages (headers/footers). Requires pages separated by form feed characters (\f).
  • keep_id (bool, default False): If True, preserves the original document ID. If False, a new ID is generated.
  • remove_substrings (list[str] | None, default None): A list of exact substrings to remove from the text.
  • remove_regex (str | None, default None): A regex pattern; all matches are removed (replaced with empty string).
  • unicode_normalization (str | None, default None): Unicode normalization form to apply. Options: "NFC", "NFKC", "NFD", "NFKD". Applied before all other steps.
  • ascii_only (bool, default False): If True, converts text to ASCII by removing accents and non-ASCII characters. Applied before pattern matching.
  • strip_whitespaces (bool, default False): If True, removes leading and trailing whitespace from the entire document content.
  • replace_regexes (dict[str, str] | None, default None): A dictionary mapping regex patterns to replacement strings. Applied after remove_regex.

Run Method

run(documents: list[Document]) -> {"documents": list[Document]}

Parameters:

  • documents (list[Document], required): The list of documents to clean.

Raises:

  • TypeError: If input is not a list of Document objects.

I/O Contract

Direction Name Type Description
Input documents list[Document] Documents with text content to clean
Output documents list[Document] Cleaned documents with transformed text content

Usage Examples

Basic Cleaning

from haystack import Document
from haystack.components.preprocessors import DocumentCleaner

doc = Document(content="This   is  a  document  to  clean\n\n\nsubstring to remove")

cleaner = DocumentCleaner(remove_substrings=["substring to remove"])
result = cleaner.run(documents=[doc])

assert result["documents"][0].content == "This is a document to clean "

Regex-Based Cleaning

from haystack import Document
from haystack.components.preprocessors import DocumentCleaner

doc = Document(content="Page 1 of 10\nActual content here\nPage 2 of 10")

cleaner = DocumentCleaner(
    remove_regex=r"Page \d+ of \d+",
    remove_empty_lines=True
)
result = cleaner.run(documents=[doc])

Unicode Normalization and ASCII Conversion

from haystack import Document
from haystack.components.preprocessors import DocumentCleaner

doc = Document(content="Cafe\u0301 Resum\u00e9")

cleaner = DocumentCleaner(
    unicode_normalization="NFC",
    ascii_only=True
)
result = cleaner.run(documents=[doc])
# Content becomes ASCII-only with accents removed

Removing Repeated Headers and Footers

from haystack import Document
from haystack.components.preprocessors import DocumentCleaner

# Pages separated by form feed characters
doc = Document(content="Header Text\nPage 1 content\f"
                       "Header Text\nPage 2 content\f"
                       "Header Text\nPage 3 content")

cleaner = DocumentCleaner(remove_repeated_substrings=True)
result = cleaner.run(documents=[doc])

Pipeline Integration

from haystack import Pipeline
from haystack.components.converters import TextFileToDocument
from haystack.components.preprocessors import DocumentCleaner, DocumentSplitter

pipeline = Pipeline()
pipeline.add_component("converter", TextFileToDocument())
pipeline.add_component("cleaner", DocumentCleaner(
    remove_empty_lines=True,
    remove_extra_whitespaces=True,
    remove_repeated_substrings=True
))
pipeline.add_component("splitter", DocumentSplitter(split_by="sentence", split_length=5))

pipeline.connect("converter.documents", "cleaner.documents")
pipeline.connect("cleaner.documents", "splitter.documents")

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment