Principle:Deepset ai Haystack Text File Conversion
Overview
Text File Conversion is the principle of transforming raw plain text files into structured Document objects that can be processed by downstream pipeline components. This conversion step serves as the bridge between unstructured file-based data and the Document-centric data model used throughout a Haystack pipeline.
Description
In any document processing pipeline, raw files must be converted into a canonical internal representation before they can be searched, embedded, or otherwise analyzed. For plain text files, this conversion involves:
- Reading the file content: The raw bytes are read from a file path or an in-memory byte stream.
- Decoding with the correct encoding: Text files can use various character encodings (UTF-8, Latin-1, etc.). The converter applies a configurable encoding, which can be overridden per-source via metadata.
- Creating a Document object: The decoded text becomes the
contentfield of a Document, which also carries metadata such as the file path, encoding information, and any user-supplied metadata. - Metadata merging: Metadata from the byte stream source (e.g., original file path) is merged with any user-supplied metadata, giving downstream components rich context about each document's origin.
Key Properties
- Encoding flexibility: Defaults to UTF-8 but supports any encoding. Per-source encoding overrides are possible through ByteStream metadata.
- Path handling: Can store either the full file path or just the file name in document metadata, depending on privacy and portability requirements.
- Graceful error handling: Files that cannot be read or decoded are skipped with a warning rather than failing the entire pipeline.
- Metadata propagation: Both source-level metadata (from ByteStream objects) and user-supplied metadata are preserved in the output documents.
Usage
Text File Conversion is used in the early stages of document ingestion pipelines, typically after a File Type Router has classified inputs by MIME type. The converter takes the text/plain output from the router and produces Document objects that can be fed into cleaners, splitters, embedders, or document stores.
A common pipeline flow is:
[FileTypeRouter] --text/plain--> [TextFileToDocument] --> [DocumentCleaner] --> [DocumentSplitter] --> [DocumentStore]
Theoretical Basis
Text File Conversion implements the Normalizer pattern in data processing, where heterogeneous input formats are transformed into a uniform internal representation. The Document data model serves as the canonical message format for the pipeline, enabling component interoperability regardless of the original source format.
The encoding handling follows established practices from the Unicode standard and Python's codec system, where byte sequences are decoded according to specified character encoding schemes.
Related Pages
- Deepset_ai_Haystack_TextFileToDocument - Implementation of Text File Conversion in Haystack
- Deepset_ai_Haystack_File_Type_Routing - Routing files by MIME type before conversion
- Deepset_ai_Haystack_PDF_Conversion - Converting PDF files to documents
- Deepset_ai_Haystack_Document_Cleaning - Cleaning converted documents