Implementation:Deepset ai Haystack DocumentCleaner
Appearance
Overview
DocumentCleaner is a Haystack component that cleans text content in Document objects by removing extra whitespace, empty lines, specified substrings, regex-matched patterns, and repeated headers and footers. It applies a configurable sequence of text transformations to improve the quality of documents for downstream processing.
Code Reference
Source file: haystack/components/preprocessors/document_cleaner.py, lines 18-159
Import:
from haystack.components.preprocessors import DocumentCleaner
Constructor
DocumentCleaner(
remove_empty_lines: bool = True,
remove_extra_whitespaces: bool = True,
remove_repeated_substrings: bool = False,
keep_id: bool = False,
remove_substrings: list[str] | None = None,
remove_regex: str | None = None,
unicode_normalization: Literal["NFC", "NFKC", "NFD", "NFKD"] | None = None,
ascii_only: bool = False,
strip_whitespaces: bool = False,
replace_regexes: dict[str, str] | None = None
)
Parameters:
remove_empty_lines(bool, defaultTrue): IfTrue, removes lines containing only whitespace.remove_extra_whitespaces(bool, defaultTrue): IfTrue, collapses multiple consecutive whitespace characters into a single space.remove_repeated_substrings(bool, defaultFalse): IfTrue, removes text repeated across pages (headers/footers). Requires pages separated by form feed characters (\f).keep_id(bool, defaultFalse): IfTrue, preserves the original document ID. IfFalse, a new ID is generated.remove_substrings(list[str] | None, defaultNone): A list of exact substrings to remove from the text.remove_regex(str | None, defaultNone): A regex pattern; all matches are removed (replaced with empty string).unicode_normalization(str | None, defaultNone): Unicode normalization form to apply. Options:"NFC","NFKC","NFD","NFKD". Applied before all other steps.ascii_only(bool, defaultFalse): IfTrue, converts text to ASCII by removing accents and non-ASCII characters. Applied before pattern matching.strip_whitespaces(bool, defaultFalse): IfTrue, removes leading and trailing whitespace from the entire document content.replace_regexes(dict[str, str] | None, defaultNone): A dictionary mapping regex patterns to replacement strings. Applied afterremove_regex.
Run Method
run(documents: list[Document]) -> {"documents": list[Document]}
Parameters:
documents(list[Document], required): The list of documents to clean.
Raises:
TypeError: If input is not a list of Document objects.
I/O Contract
| Direction | Name | Type | Description |
|---|---|---|---|
| Input | documents | list[Document] | Documents with text content to clean |
| Output | documents | list[Document] | Cleaned documents with transformed text content |
Usage Examples
Basic Cleaning
from haystack import Document
from haystack.components.preprocessors import DocumentCleaner
doc = Document(content="This is a document to clean\n\n\nsubstring to remove")
cleaner = DocumentCleaner(remove_substrings=["substring to remove"])
result = cleaner.run(documents=[doc])
assert result["documents"][0].content == "This is a document to clean "
Regex-Based Cleaning
from haystack import Document
from haystack.components.preprocessors import DocumentCleaner
doc = Document(content="Page 1 of 10\nActual content here\nPage 2 of 10")
cleaner = DocumentCleaner(
remove_regex=r"Page \d+ of \d+",
remove_empty_lines=True
)
result = cleaner.run(documents=[doc])
Unicode Normalization and ASCII Conversion
from haystack import Document
from haystack.components.preprocessors import DocumentCleaner
doc = Document(content="Cafe\u0301 Resum\u00e9")
cleaner = DocumentCleaner(
unicode_normalization="NFC",
ascii_only=True
)
result = cleaner.run(documents=[doc])
# Content becomes ASCII-only with accents removed
from haystack import Document
from haystack.components.preprocessors import DocumentCleaner
# Pages separated by form feed characters
doc = Document(content="Header Text\nPage 1 content\f"
"Header Text\nPage 2 content\f"
"Header Text\nPage 3 content")
cleaner = DocumentCleaner(remove_repeated_substrings=True)
result = cleaner.run(documents=[doc])
Pipeline Integration
from haystack import Pipeline
from haystack.components.converters import TextFileToDocument
from haystack.components.preprocessors import DocumentCleaner, DocumentSplitter
pipeline = Pipeline()
pipeline.add_component("converter", TextFileToDocument())
pipeline.add_component("cleaner", DocumentCleaner(
remove_empty_lines=True,
remove_extra_whitespaces=True,
remove_repeated_substrings=True
))
pipeline.add_component("splitter", DocumentSplitter(split_by="sentence", split_length=5))
pipeline.connect("converter.documents", "cleaner.documents")
pipeline.connect("cleaner.documents", "splitter.documents")
Related Pages
Implements Principle
- Deepset_ai_Haystack_Document_Cleaning - The principle behind document cleaning
- Deepset_ai_Haystack_TextFileToDocument - Text file converter that produces documents for cleaning
- Deepset_ai_Haystack_PyPDFToDocument - PDF converter that produces documents for cleaning
- Deepset_ai_Haystack_DocumentSplitter - Splits cleaned documents into chunks
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment