Implementation:Deepset ai Haystack FileTypeRouter

Overview

FileTypeRouter is a Haystack component that categorizes files or byte streams by their MIME types, enabling content-based routing in document processing pipelines. It supports both exact MIME type matching and regex patterns for flexible classification of input sources.

Code Reference

Source file: haystack/components/routers/file_type_router.py, lines 24-199

Import:

from haystack.components.routers import FileTypeRouter

Constructor

FileTypeRouter(
    mime_types: list[str],
    additional_mimetypes: dict[str, str] | None = None,
    raise_on_failure: bool = False
)

Parameters:

mime_types (list[str], required): A list of MIME types or regex patterns to classify input files or byte streams. Examples: ["text/plain", "application/pdf"] for exact matching, or [r"audio/.*", r"text/plain"] for regex matching.
additional_mimetypes (dict[str, str] | None, default None): A dictionary mapping MIME types to file extensions, used to register custom MIME types with the mimetypes module. Example: {"application/vnd.openxmlformats-officedocument.wordprocessingml.document": ".docx"}.
raise_on_failure (bool, default False): If True, raises FileNotFoundError when a file path does not exist. If False, emits a warning and places the file in the failed output.

Run Method

run(
    sources: list[str | Path | ByteStream],
    meta: dict[str, Any] | list[dict[str, Any]] | None = None
) -> dict[str, list[ByteStream | Path]]

Parameters:

sources (list[str | Path | ByteStream], required): A list of file paths or byte streams to categorize.
meta (dict | list[dict] | None, default None): Optional metadata to attach to the sources. A single dictionary is applied to all sources; a list must match the length of sources.

I/O Contract

Direction	Name	Type	Description
Input	sources	Path \| ByteStream]	File paths or byte streams to classify
Input	meta	list[dict] \| None	Optional metadata for sources
Output	<mime_type>	Path \| ByteStream]	Sources matching each configured MIME type pattern
Output	unclassified	Path \| ByteStream]	Sources whose MIME type matched no pattern
Output	failed	Path \| ByteStream]	Sources that could not be processed (e.g., missing files)

Usage Examples

Basic Exact MIME Type Matching

from haystack.components.routers import FileTypeRouter
from pathlib import Path

router = FileTypeRouter(mime_types=["text/plain", "application/pdf"])

sources = [Path("readme.txt"), Path("report.pdf"), Path("image.png")]
result = router.run(sources=sources)

# result["text/plain"] contains [PosixPath('readme.txt')]
# result["application/pdf"] contains [PosixPath('report.pdf')]
# result["unclassified"] contains [PosixPath('image.png')]

Regex Pattern Matching

from haystack.components.routers import FileTypeRouter

router = FileTypeRouter(mime_types=[r"audio/.*", r"text/.*"])

sources = [Path("song.mp3"), Path("notes.txt"), Path("data.csv")]
result = router.run(sources=sources)

# result["audio/.*"] contains audio files
# result["text/.*"] contains text files

Pipeline Integration with Converters

from haystack import Pipeline
from haystack.components.routers import FileTypeRouter
from haystack.components.converters import TextFileToDocument, PyPDFToDocument

pipeline = Pipeline()
pipeline.add_component("router", FileTypeRouter(mime_types=["text/plain", "application/pdf"]))
pipeline.add_component("text_converter", TextFileToDocument())
pipeline.add_component("pdf_converter", PyPDFToDocument())

pipeline.connect("router.text/plain", "text_converter.sources")
pipeline.connect("router.application/pdf", "pdf_converter.sources")

result = pipeline.run({"router": {"sources": ["readme.txt", "report.pdf"]}})

Registering Custom MIME Types

from haystack.components.routers import FileTypeRouter

router = FileTypeRouter(
    mime_types=["application/vnd.openxmlformats-officedocument.wordprocessingml.document"],
    additional_mimetypes={
        "application/vnd.openxmlformats-officedocument.wordprocessingml.document": ".docx"
    }
)

Related Pages

Implements Principle

Principle:Deepset_ai_Haystack_File_Type_Routing

Deepset_ai_Haystack_File_Type_Routing - The principle behind File Type Routing
Deepset_ai_Haystack_TextFileToDocument - Text file converter component
Deepset_ai_Haystack_PyPDFToDocument - PDF file converter component
Deepset_ai_Haystack_MetadataRouter - Router based on document metadata

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment