Implementation:Deepset ai Haystack TextFileToDocument
Appearance
Overview
TextFileToDocument is a Haystack component that converts plain text files into Document objects suitable for pipeline processing. It reads text files or byte streams, decodes their content using a configurable encoding, and produces Document objects with merged metadata.
Code Reference
Source file: haystack/components/converters/txt.py, lines 17-97
Import:
from haystack.components.converters import TextFileToDocument
Constructor
TextFileToDocument(
encoding: str = "utf-8",
store_full_path: bool = False
)
Parameters:
encoding(str, default"utf-8"): The character encoding to use when decoding text files. If a source ByteStream specifies an encoding in its metadata, that value overrides this default.store_full_path(bool, defaultFalse): IfTrue, the full file path is stored in the document metadata. IfFalse, only the file name (basename) is stored.
Run Method
run(
sources: list[str | Path | ByteStream],
meta: dict[str, Any] | list[dict[str, Any]] | None = None
) -> {"documents": list[Document]}
Parameters:
sources(list[str | Path | ByteStream], required): A list of file paths or ByteStream objects to convert.meta(dict | list[dict] | None, defaultNone): Optional metadata to attach. A single dictionary is added to all documents; a list must match the number of sources.
I/O Contract
| Direction | Name | Type | Description |
|---|---|---|---|
| Input | sources | Path | ByteStream] | Text file paths or byte streams to convert |
| Input | meta | list[dict] | None | Optional metadata to attach to documents |
| Output | documents | list[Document] | Converted Document objects with text content and metadata |
Usage Examples
Basic Text File Conversion
from haystack.components.converters import TextFileToDocument
converter = TextFileToDocument()
results = converter.run(sources=["sample.txt"])
documents = results["documents"]
print(documents[0].content)
# 'This is the content from the txt file.'
Custom Encoding
from haystack.components.converters import TextFileToDocument
converter = TextFileToDocument(encoding="latin-1")
results = converter.run(sources=["legacy_file.txt"])
With Metadata
from haystack.components.converters import TextFileToDocument
converter = TextFileToDocument(store_full_path=True)
results = converter.run(
sources=["notes.txt", "readme.txt"],
meta=[{"author": "Alice"}, {"author": "Bob"}]
)
for doc in results["documents"]:
print(doc.meta)
Pipeline Integration
from haystack import Pipeline
from haystack.components.routers import FileTypeRouter
from haystack.components.converters import TextFileToDocument
from haystack.components.preprocessors import DocumentCleaner
pipeline = Pipeline()
pipeline.add_component("router", FileTypeRouter(mime_types=["text/plain"]))
pipeline.add_component("converter", TextFileToDocument())
pipeline.add_component("cleaner", DocumentCleaner())
pipeline.connect("router.text/plain", "converter.sources")
pipeline.connect("converter.documents", "cleaner.documents")
result = pipeline.run({"router": {"sources": ["document.txt"]}})
Related Pages
Implements Principle
- Deepset_ai_Haystack_Text_File_Conversion - The principle behind text file conversion
- Deepset_ai_Haystack_FileTypeRouter - Routes files by MIME type before conversion
- Deepset_ai_Haystack_PyPDFToDocument - PDF file converter component
- Deepset_ai_Haystack_DocumentCleaner - Cleans converted documents
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment