Implementation:Deepset ai Haystack TextFileToDocument

Overview

TextFileToDocument is a Haystack component that converts plain text files into Document objects suitable for pipeline processing. It reads text files or byte streams, decodes their content using a configurable encoding, and produces Document objects with merged metadata.

Code Reference

Source file: haystack/components/converters/txt.py, lines 17-97

Import:

from haystack.components.converters import TextFileToDocument

Constructor

TextFileToDocument(
    encoding: str = "utf-8",
    store_full_path: bool = False
)

Parameters:

encoding (str, default "utf-8"): The character encoding to use when decoding text files. If a source ByteStream specifies an encoding in its metadata, that value overrides this default.
store_full_path (bool, default False): If True, the full file path is stored in the document metadata. If False, only the file name (basename) is stored.

Run Method

run(
    sources: list[str | Path | ByteStream],
    meta: dict[str, Any] | list[dict[str, Any]] | None = None
) -> {"documents": list[Document]}

Parameters:

sources (list[str | Path | ByteStream], required): A list of file paths or ByteStream objects to convert.
meta (dict | list[dict] | None, default None): Optional metadata to attach. A single dictionary is added to all documents; a list must match the number of sources.

I/O Contract

Direction	Name	Type	Description
Input	sources	Path \| ByteStream]	Text file paths or byte streams to convert
Input	meta	list[dict] \| None	Optional metadata to attach to documents
Output	documents	list[Document]	Converted Document objects with text content and metadata

Usage Examples

Basic Text File Conversion

from haystack.components.converters import TextFileToDocument

converter = TextFileToDocument()
results = converter.run(sources=["sample.txt"])
documents = results["documents"]
print(documents[0].content)
# 'This is the content from the txt file.'

Custom Encoding

from haystack.components.converters import TextFileToDocument

converter = TextFileToDocument(encoding="latin-1")
results = converter.run(sources=["legacy_file.txt"])

With Metadata

from haystack.components.converters import TextFileToDocument

converter = TextFileToDocument(store_full_path=True)
results = converter.run(
    sources=["notes.txt", "readme.txt"],
    meta=[{"author": "Alice"}, {"author": "Bob"}]
)
for doc in results["documents"]:
    print(doc.meta)

Pipeline Integration

from haystack import Pipeline
from haystack.components.routers import FileTypeRouter
from haystack.components.converters import TextFileToDocument
from haystack.components.preprocessors import DocumentCleaner

pipeline = Pipeline()
pipeline.add_component("router", FileTypeRouter(mime_types=["text/plain"]))
pipeline.add_component("converter", TextFileToDocument())
pipeline.add_component("cleaner", DocumentCleaner())

pipeline.connect("router.text/plain", "converter.sources")
pipeline.connect("converter.documents", "cleaner.documents")

result = pipeline.run({"router": {"sources": ["document.txt"]}})

Related Pages

Implements Principle

Principle:Deepset_ai_Haystack_Text_File_Conversion

Deepset_ai_Haystack_Text_File_Conversion - The principle behind text file conversion
Deepset_ai_Haystack_FileTypeRouter - Routes files by MIME type before conversion
Deepset_ai_Haystack_PyPDFToDocument - PDF file converter component
Deepset_ai_Haystack_DocumentCleaner - Cleans converted documents

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment