Principle:Deepset ai Haystack Document Cleaning
Overview
Document Cleaning is the principle of removing noise and artifacts from document text to improve the quality of downstream processing. Raw text extracted from files (especially PDFs and OCR outputs) often contains extra whitespace, empty lines, repeated headers and footers, encoding artifacts, and other noise that can degrade the performance of embedders, retrievers, and language models. Document Cleaning applies a sequence of configurable text transformations to produce cleaner, more uniform text.
Description
Text extracted from structured documents frequently contains artifacts that are not part of the meaningful content. These artifacts arise from:
- Format conversion: PDF-to-text conversion may introduce extra spaces, inconsistent line breaks, and repeated page headers/footers.
- OCR errors: Optical character recognition may produce non-ASCII characters, encoding anomalies, or whitespace artifacts.
- Template content: Boilerplate text, legal disclaimers, or watermarks may be repeated across pages.
Document Cleaning addresses these issues through a configurable pipeline of text transformations applied in a specific order:
- Unicode normalization: Standardizes character representations (NFC, NFKC, NFD, NFKD) to ensure consistent text processing.
- ASCII conversion: Optionally strips accented characters and non-ASCII content, useful for systems that require pure ASCII input.
- Extra whitespace removal: Collapses sequences of multiple whitespace characters into single spaces.
- Empty line removal: Eliminates lines that contain only whitespace.
- Substring removal: Removes specific known substrings (e.g., watermarks, boilerplate text).
- Regex-based removal: Removes text matching arbitrary regular expression patterns.
- Regex-based replacement: Replaces text matching patterns with custom strings.
- Repeated substring removal: Detects and removes text that appears identically across multiple pages, such as headers and footers. This uses a longest-common-ngram heuristic across page boundaries (marked by form feed characters).
- Whitespace stripping: Optionally trims leading and trailing whitespace from the entire document.
Key Properties
- Configurable pipeline: Each cleaning step can be independently enabled or disabled.
- Order of operations: Steps are applied in a fixed, well-defined order to avoid interference between transformations.
- Page-aware processing: Several operations respect page boundaries (form feed characters), enabling correct handling of multi-page documents.
- ID handling: Documents can optionally retain their original IDs or receive new IDs after cleaning.
- Non-destructive design: New Document objects are created rather than modifying originals in place.
Usage
Document Cleaning is used after file conversion and before document splitting in the ingestion pipeline. It ensures that the text fed into splitters, embedders, and retrievers is free of noise that could compromise semantic understanding or matching quality.
[Converter] --> [DocumentCleaner] --> [DocumentSplitter] --> [Embedder] --> [DocumentStore]
Theoretical Basis
Document Cleaning draws on principles from text normalization in NLP preprocessing. Unicode normalization forms (defined by the Unicode Standard Annex #15) ensure that semantically equivalent character sequences are represented identically. The header/footer detection algorithm uses a longest common substring heuristic, searching for n-grams that appear in the first or last N characters of every page, which is effective for detecting static repeated content like page numbers, copyright notices, and running headers.
The cleaning order is designed so that broader transformations (normalization, whitespace) happen before pattern-specific ones (regex, substring removal), preventing cleaned patterns from being re-introduced by later steps.
Related Pages
- Deepset_ai_Haystack_DocumentCleaner - Implementation of Document Cleaning in Haystack
- Deepset_ai_Haystack_Text_File_Conversion - Converting text files to documents before cleaning
- Deepset_ai_Haystack_PDF_Conversion - Converting PDFs to documents before cleaning
- Deepset_ai_Haystack_Document_Splitting - Splitting cleaned documents into chunks