Principle:Deepset ai Haystack File Type Routing
Overview
File Type Routing is the principle of directing heterogeneous file inputs to appropriate processing branches based on their MIME type. In data ingestion pipelines that handle multiple file formats (PDF, plain text, HTML, audio, images), each format requires a specialized converter. File Type Routing inspects each incoming file source, determines its MIME type through extension-based lookup or metadata inspection, and dispatches it to the correct downstream component.
Description
When building ETL or document-processing pipelines, the input often consists of files in a variety of formats. A naive approach would require the user to pre-sort files by type before feeding them into the pipeline. File Type Routing automates this step by acting as a dispatcher node that classifies each file and routes it accordingly.
The core mechanism works as follows:
- MIME type detection: For file paths, the MIME type is determined from the file extension using the system MIME type database. For in-memory byte streams, the MIME type is read from the stream's metadata.
- Pattern matching: Each configured MIME type can be specified as an exact string (e.g.,
text/plain) or as a regular expression pattern (e.g.,audio/.*). The router attempts a full match of the detected MIME type against each pattern in order. - Classification outcomes: Files that match a pattern are placed into the corresponding output bucket. Files that do not match any pattern are placed in an
unclassifiedbucket. Files that cannot be read (e.g., nonexistent paths) go to afailedbucket.
This routing pattern enables fan-out architectures where a single file source feeds into multiple specialized converters, each handling its own format.
Key Properties
- Extensibility: Custom MIME types can be registered to handle non-standard or proprietary file formats.
- Regex support: Broad categories (such as all audio or all image types) can be captured with a single pattern.
- Error handling: Missing files can either raise exceptions or be silently classified as failed, depending on configuration.
- Metadata propagation: Optional metadata can be attached to sources during routing, converting file paths to ByteStream objects as needed.
Usage
File Type Routing is used at the entry point of document ingestion pipelines where multiple file formats must be processed. The router is placed immediately after the file source and before format-specific converters. Each output connection from the router feeds into the appropriate converter component (e.g., a text converter, a PDF converter, or an HTML converter).
A typical pipeline topology looks like:
[File Sources] --> [FileTypeRouter] --text/plain--> [TextFileToDocument]
--application/pdf--> [PyPDFToDocument]
--unclassified--> [FallbackHandler]
Theoretical Basis
File Type Routing is grounded in the MIME type standard (RFC 2045, RFC 6838), which defines a hierarchical naming scheme for content types. The type/subtype structure (e.g., application/pdf, text/plain) provides a standardized vocabulary for classifying file content. Extension-based detection relies on well-maintained mappings between file extensions and MIME types maintained by operating systems and the Python mimetypes module.
The routing pattern itself follows the Content-Based Router enterprise integration pattern, where the content (or metadata) of a message determines its routing path through a processing pipeline.
Related Pages
- Deepset_ai_Haystack_FileTypeRouter - Implementation of File Type Routing in Haystack
- Deepset_ai_Haystack_Text_File_Conversion - Principle for converting text files to documents
- Deepset_ai_Haystack_PDF_Conversion - Principle for converting PDF files to documents
- Deepset_ai_Haystack_Metadata_Based_Routing - Routing based on document metadata fields