Workflow:Deepset ai Haystack Document Preprocessing Pipeline

Knowledge Sources	Haystack Haystack Docs
Domains	Data_Engineering, NLP, Document_Processing
Last Updated	2026-02-11 20:00 GMT

Overview

End-to-end process for routing, converting, classifying, filtering, cleaning, splitting, and embedding documents with language-aware preprocessing logic.

Description

This workflow implements an advanced document preprocessing pipeline that goes beyond basic indexing by incorporating language detection and metadata-based routing. Incoming files are routed by type to appropriate converters, then each document is classified by language using a DocumentLanguageClassifier. A MetadataRouter filters documents by language (e.g., English only), and only qualifying documents proceed through cleaning, splitting, embedding, and storage. This pattern is essential for multilingual document collections where downstream models are language-specific.

Usage

Execute this workflow when you have a multilingual document collection and need to preprocess only documents in specific languages, or when you need metadata-driven routing logic in your preprocessing pipeline. This is also the appropriate pattern when documents require language-specific processing chains or when you want to filter out unsupported languages before embedding.

Execution Steps

Step 1: Route Files by Type

Use a FileTypeRouter to classify incoming file sources by MIME type and direct each file to the appropriate converter. Unrecognized file types are excluded from processing.

Key considerations:

Configure supported MIME types (e.g., text/plain)
Files not matching any configured type are routed to unclassified output

Step 2: Convert Files to Documents

Transform raw file content into Haystack Document objects using format-specific converters such as TextFileToDocument. Metadata from the source file is preserved in the document.

Key considerations:

Each file format requires its own converter component
Converter preserves source file metadata

Step 3: Classify Document Language

Apply a DocumentLanguageClassifier to detect and tag each document with its language. The classifier adds a language metadata field to each document, which subsequent routing components can use for filtering.

Key considerations:

Language detection is automatic based on document content
Language code is stored in document metadata (e.g., "en", "de")
Requires sufficient text content for accurate detection

Step 4: Route by Language

Use a MetadataRouter with rules to filter documents based on the detected language. Only documents matching the desired language(s) proceed to downstream processing. Documents in other languages are routed to separate outputs or discarded.

Key considerations:

Rules use field-operator-value syntax (e.g., language == "en")
Multiple language routes can be configured for parallel processing chains
Non-matching documents go to the default unmatched output

Step 5: Clean Document Content

Apply DocumentCleaner to the filtered documents to remove noise, extra whitespace, and artifacts from the text content.

Key considerations:

Cleaning operates on the text content while preserving metadata
Configurable cleaning operations

Step 6: Split Documents into Chunks

Use DocumentSplitter to break documents into smaller chunks suitable for embedding and retrieval. The split strategy and chunk size are configurable.

Key considerations:

split_by options: "period", "word", "sentence", "page"
split_length controls the maximum chunk size
Each chunk retains the parent document's metadata including language

Step 7: Generate Embeddings and Store

Compute vector embeddings for each cleaned, split document chunk using a SentenceTransformersDocumentEmbedder and persist them to the document store via a DocumentWriter.

Key considerations:

Embedding model should match the model used in the query pipeline
All processed documents retain their language metadata for downstream filtering

Execution Diagram

GitHub URL

Workflow Repository