Implementation:Deepset ai Haystack FileTypeRouter
Appearance
Overview
FileTypeRouter is a Haystack component that categorizes files or byte streams by their MIME types, enabling content-based routing in document processing pipelines. It supports both exact MIME type matching and regex patterns for flexible classification of input sources.
Code Reference
Source file: haystack/components/routers/file_type_router.py, lines 24-199
Import:
from haystack.components.routers import FileTypeRouter
Constructor
FileTypeRouter(
mime_types: list[str],
additional_mimetypes: dict[str, str] | None = None,
raise_on_failure: bool = False
)
Parameters:
mime_types(list[str], required): A list of MIME types or regex patterns to classify input files or byte streams. Examples:["text/plain", "application/pdf"]for exact matching, or[r"audio/.*", r"text/plain"]for regex matching.additional_mimetypes(dict[str, str] | None, defaultNone): A dictionary mapping MIME types to file extensions, used to register custom MIME types with themimetypesmodule. Example:{"application/vnd.openxmlformats-officedocument.wordprocessingml.document": ".docx"}.raise_on_failure(bool, defaultFalse): IfTrue, raisesFileNotFoundErrorwhen a file path does not exist. IfFalse, emits a warning and places the file in thefailedoutput.
Run Method
run(
sources: list[str | Path | ByteStream],
meta: dict[str, Any] | list[dict[str, Any]] | None = None
) -> dict[str, list[ByteStream | Path]]
Parameters:
sources(list[str | Path | ByteStream], required): A list of file paths or byte streams to categorize.meta(dict | list[dict] | None, defaultNone): Optional metadata to attach to the sources. A single dictionary is applied to all sources; a list must match the length of sources.
I/O Contract
| Direction | Name | Type | Description |
|---|---|---|---|
| Input | sources | Path | ByteStream] | File paths or byte streams to classify |
| Input | meta | list[dict] | None | Optional metadata for sources |
| Output | <mime_type> | Path | ByteStream] | Sources matching each configured MIME type pattern |
| Output | unclassified | Path | ByteStream] | Sources whose MIME type matched no pattern |
| Output | failed | Path | ByteStream] | Sources that could not be processed (e.g., missing files) |
Usage Examples
Basic Exact MIME Type Matching
from haystack.components.routers import FileTypeRouter
from pathlib import Path
router = FileTypeRouter(mime_types=["text/plain", "application/pdf"])
sources = [Path("readme.txt"), Path("report.pdf"), Path("image.png")]
result = router.run(sources=sources)
# result["text/plain"] contains [PosixPath('readme.txt')]
# result["application/pdf"] contains [PosixPath('report.pdf')]
# result["unclassified"] contains [PosixPath('image.png')]
Regex Pattern Matching
from haystack.components.routers import FileTypeRouter
router = FileTypeRouter(mime_types=[r"audio/.*", r"text/.*"])
sources = [Path("song.mp3"), Path("notes.txt"), Path("data.csv")]
result = router.run(sources=sources)
# result["audio/.*"] contains audio files
# result["text/.*"] contains text files
Pipeline Integration with Converters
from haystack import Pipeline
from haystack.components.routers import FileTypeRouter
from haystack.components.converters import TextFileToDocument, PyPDFToDocument
pipeline = Pipeline()
pipeline.add_component("router", FileTypeRouter(mime_types=["text/plain", "application/pdf"]))
pipeline.add_component("text_converter", TextFileToDocument())
pipeline.add_component("pdf_converter", PyPDFToDocument())
pipeline.connect("router.text/plain", "text_converter.sources")
pipeline.connect("router.application/pdf", "pdf_converter.sources")
result = pipeline.run({"router": {"sources": ["readme.txt", "report.pdf"]}})
Registering Custom MIME Types
from haystack.components.routers import FileTypeRouter
router = FileTypeRouter(
mime_types=["application/vnd.openxmlformats-officedocument.wordprocessingml.document"],
additional_mimetypes={
"application/vnd.openxmlformats-officedocument.wordprocessingml.document": ".docx"
}
)
Related Pages
Implements Principle
- Deepset_ai_Haystack_File_Type_Routing - The principle behind File Type Routing
- Deepset_ai_Haystack_TextFileToDocument - Text file converter component
- Deepset_ai_Haystack_PyPDFToDocument - PDF file converter component
- Deepset_ai_Haystack_MetadataRouter - Router based on document metadata
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment