Workflow:Deepset ai Haystack Document Indexing Pipeline
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, NLP, Document_Processing |
| Last Updated | 2026-02-11 20:00 GMT |
Overview
End-to-end process for ingesting, converting, cleaning, splitting, embedding, and storing documents from multiple file formats into a searchable document store.
Description
This workflow covers the complete document indexing pipeline in Haystack. It routes incoming files by type (text, PDF, etc.) to appropriate converters, joins the converted documents, cleans the text content, splits documents into manageable chunks, generates vector embeddings for each chunk, and writes the embedded documents into a document store. This is the prerequisite pipeline that must run before any retrieval or RAG pipeline can operate.
Usage
Execute this workflow when you have raw documents (text files, PDFs, or other formats) that need to be processed and stored in a document store for subsequent retrieval. This is typically the first pipeline to run when setting up a new knowledge base or updating an existing one with new documents.
Execution Steps
Step 1: Route Files by Type
Use a FileTypeRouter to classify incoming file sources by their MIME type (e.g., text/plain, application/pdf). Each file type is routed to its corresponding converter component.
Key considerations:
- Configure MIME types for all expected file formats
- Unsupported file types are routed to an unclassified output
Step 2: Convert Files to Documents
Convert raw files into Haystack Document objects using format-specific converters. TextFileToDocument handles plain text files, PyPDFToDocument handles PDF files. Each converter extracts text content and preserves metadata.
Key considerations:
- TextFileToDocument for .txt files
- PyPDFToDocument for .pdf files (preserves page numbers in metadata)
- Additional converters available: DocxToDocument, HTMLToDocument, etc.
Step 3: Join Converted Documents
Merge documents from multiple converter outputs into a single stream using a DocumentJoiner. This unifies the parallel conversion paths into one sequential flow.
Key considerations:
- DocumentJoiner accepts multiple input streams
- Preserves all metadata from source converters
Step 4: Clean Document Content
Apply text cleaning operations using DocumentCleaner to remove noise, extra whitespace, headers/footers, and other artifacts from the extracted text.
Key considerations:
- Configurable cleaning rules
- Preserves document metadata while cleaning content
Step 5: Split Documents into Chunks
Break documents into smaller, semantically coherent chunks using DocumentSplitter. Splitting can be by sentence, word count, page, or custom function. Overlap between chunks helps maintain context across boundaries.
Key considerations:
- split_by parameter: "period", "word", "sentence", "page"
- split_length controls chunk size
- split_overlap maintains context between adjacent chunks
Step 6: Generate Embeddings
Compute vector embeddings for each document chunk using a document embedder (e.g., SentenceTransformersDocumentEmbedder). The embeddings enable semantic search in the retrieval pipeline.
Key considerations:
- Choose an embedding model matching your retrieval pipeline's text embedder
- Model consistency between indexing and querying is critical
- Progress bar available for monitoring large batches
Step 7: Write to Document Store
Persist the embedded document chunks to the document store using DocumentWriter. Configure duplicate handling policy to control behavior when re-indexing.
Key considerations:
- DuplicatePolicy options: OVERWRITE, SKIP, or NONE
- Returns count of documents written for verification