Workflow:Deepset ai Haystack Document Indexing Pipeline

Knowledge Sources	Haystack Haystack Docs
Domains	Data_Engineering, NLP, Document_Processing
Last Updated	2026-02-11 20:00 GMT

Overview

End-to-end process for ingesting, converting, cleaning, splitting, embedding, and storing documents from multiple file formats into a searchable document store.

Description

This workflow covers the complete document indexing pipeline in Haystack. It routes incoming files by type (text, PDF, etc.) to appropriate converters, joins the converted documents, cleans the text content, splits documents into manageable chunks, generates vector embeddings for each chunk, and writes the embedded documents into a document store. This is the prerequisite pipeline that must run before any retrieval or RAG pipeline can operate.

Usage

Execute this workflow when you have raw documents (text files, PDFs, or other formats) that need to be processed and stored in a document store for subsequent retrieval. This is typically the first pipeline to run when setting up a new knowledge base or updating an existing one with new documents.

Execution Steps

Step 1: Route Files by Type

Use a FileTypeRouter to classify incoming file sources by their MIME type (e.g., text/plain, application/pdf). Each file type is routed to its corresponding converter component.

Key considerations:

Configure MIME types for all expected file formats
Unsupported file types are routed to an unclassified output

Step 2: Convert Files to Documents

Convert raw files into Haystack Document objects using format-specific converters. TextFileToDocument handles plain text files, PyPDFToDocument handles PDF files. Each converter extracts text content and preserves metadata.

Key considerations:

TextFileToDocument for .txt files
PyPDFToDocument for .pdf files (preserves page numbers in metadata)
Additional converters available: DocxToDocument, HTMLToDocument, etc.

Step 3: Join Converted Documents

Merge documents from multiple converter outputs into a single stream using a DocumentJoiner. This unifies the parallel conversion paths into one sequential flow.

Key considerations:

DocumentJoiner accepts multiple input streams
Preserves all metadata from source converters

Step 4: Clean Document Content

Apply text cleaning operations using DocumentCleaner to remove noise, extra whitespace, headers/footers, and other artifacts from the extracted text.

Key considerations:

Configurable cleaning rules
Preserves document metadata while cleaning content

Step 5: Split Documents into Chunks

Break documents into smaller, semantically coherent chunks using DocumentSplitter. Splitting can be by sentence, word count, page, or custom function. Overlap between chunks helps maintain context across boundaries.

Key considerations:

split_by parameter: "period", "word", "sentence", "page"
split_length controls chunk size
split_overlap maintains context between adjacent chunks

Step 6: Generate Embeddings

Compute vector embeddings for each document chunk using a document embedder (e.g., SentenceTransformersDocumentEmbedder). The embeddings enable semantic search in the retrieval pipeline.

Key considerations:

Choose an embedding model matching your retrieval pipeline's text embedder
Model consistency between indexing and querying is critical
Progress bar available for monitoring large batches

Step 7: Write to Document Store

Persist the embedded document chunks to the document store using DocumentWriter. Configure duplicate handling policy to control behavior when re-indexing.

Key considerations:

DuplicatePolicy options: OVERWRITE, SKIP, or NONE
Returns count of documents written for verification

Execution Diagram

GitHub URL

Workflow Repository