Workflow:Deepset ai Haystack Document Preprocessing Pipeline
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, NLP, Document_Processing |
| Last Updated | 2026-02-11 20:00 GMT |
Overview
End-to-end process for routing, converting, classifying, filtering, cleaning, splitting, and embedding documents with language-aware preprocessing logic.
Description
This workflow implements an advanced document preprocessing pipeline that goes beyond basic indexing by incorporating language detection and metadata-based routing. Incoming files are routed by type to appropriate converters, then each document is classified by language using a DocumentLanguageClassifier. A MetadataRouter filters documents by language (e.g., English only), and only qualifying documents proceed through cleaning, splitting, embedding, and storage. This pattern is essential for multilingual document collections where downstream models are language-specific.
Usage
Execute this workflow when you have a multilingual document collection and need to preprocess only documents in specific languages, or when you need metadata-driven routing logic in your preprocessing pipeline. This is also the appropriate pattern when documents require language-specific processing chains or when you want to filter out unsupported languages before embedding.
Execution Steps
Step 1: Route Files by Type
Use a FileTypeRouter to classify incoming file sources by MIME type and direct each file to the appropriate converter. Unrecognized file types are excluded from processing.
Key considerations:
- Configure supported MIME types (e.g., text/plain)
- Files not matching any configured type are routed to unclassified output
Step 2: Convert Files to Documents
Transform raw file content into Haystack Document objects using format-specific converters such as TextFileToDocument. Metadata from the source file is preserved in the document.
Key considerations:
- Each file format requires its own converter component
- Converter preserves source file metadata
Step 3: Classify Document Language
Apply a DocumentLanguageClassifier to detect and tag each document with its language. The classifier adds a language metadata field to each document, which subsequent routing components can use for filtering.
Key considerations:
- Language detection is automatic based on document content
- Language code is stored in document metadata (e.g., "en", "de")
- Requires sufficient text content for accurate detection
Step 4: Route by Language
Use a MetadataRouter with rules to filter documents based on the detected language. Only documents matching the desired language(s) proceed to downstream processing. Documents in other languages are routed to separate outputs or discarded.
Key considerations:
- Rules use field-operator-value syntax (e.g., language == "en")
- Multiple language routes can be configured for parallel processing chains
- Non-matching documents go to the default unmatched output
Step 5: Clean Document Content
Apply DocumentCleaner to the filtered documents to remove noise, extra whitespace, and artifacts from the text content.
Key considerations:
- Cleaning operates on the text content while preserving metadata
- Configurable cleaning operations
Step 6: Split Documents into Chunks
Use DocumentSplitter to break documents into smaller chunks suitable for embedding and retrieval. The split strategy and chunk size are configurable.
Key considerations:
- split_by options: "period", "word", "sentence", "page"
- split_length controls the maximum chunk size
- Each chunk retains the parent document's metadata including language
Step 7: Generate Embeddings and Store
Compute vector embeddings for each cleaned, split document chunk using a SentenceTransformersDocumentEmbedder and persist them to the document store via a DocumentWriter.
Key considerations:
- Embedding model should match the model used in the query pipeline
- All processed documents retain their language metadata for downstream filtering