Implementation:Deepset ai Haystack PyPDFToDocument
Appearance
Overview
PyPDFToDocument is a Haystack component that converts PDF files into Document objects using the PyPDF library. It supports configurable text extraction modes (plain and layout), handles multiple input sources, and propagates metadata through the conversion process.
Code Reference
Source file: haystack/components/converters/pypdf.py, lines 51-223
Import:
from haystack.components.converters import PyPDFToDocument
Dependencies: pypdf (install via pip install pypdf)
Constructor
PyPDFToDocument(
*,
extraction_mode: str | PyPDFExtractionMode = "plain",
plain_mode_orientations: tuple = (0, 90, 180, 270),
plain_mode_space_width: float = 200.0,
layout_mode_space_vertically: bool = True,
layout_mode_scale_weight: float = 1.25,
layout_mode_strip_rotated: bool = True,
layout_mode_font_height_weight: float = 1.0,
store_full_path: bool = False
)
Parameters:
extraction_mode(str | PyPDFExtractionMode, default"plain"): The text extraction mode."plain"for standard extraction,"layout"for layout-preserving extraction.plain_mode_orientations(tuple, default(0, 90, 180, 270)): Tuple of text orientations to look for in plain mode. Ignored in layout mode.plain_mode_space_width(float, default200.0): Forces default space width if not extracted from font. Ignored in layout mode.layout_mode_space_vertically(bool, defaultTrue): Whether to include blank lines inferred from y distance and font height. Ignored in plain mode.layout_mode_scale_weight(float, default1.25): Multiplier for string length when calculating weighted average character width. Ignored in plain mode.layout_mode_strip_rotated(bool, defaultTrue): Whether to strip rotated text in layout mode. Ignored in plain mode.layout_mode_font_height_weight(float, default1.0): Multiplier for font height when calculating blank line height. Ignored in plain mode.store_full_path(bool, defaultFalse): IfTrue, stores the full path in metadata. IfFalse, stores only the file name.
Run Method
run(
sources: list[str | Path | ByteStream],
meta: dict[str, Any] | list[dict[str, Any]] | None = None
) -> {"documents": list[Document]}
Parameters:
sources(list[str | Path | ByteStream], required): A list of PDF file paths or ByteStream objects to convert.meta(dict | list[dict] | None, defaultNone): Optional metadata to attach. A single dictionary applies to all documents; a list must match the number of sources.
I/O Contract
| Direction | Name | Type | Description |
|---|---|---|---|
| Input | sources | Path | ByteStream] | PDF file paths or byte streams to convert |
| Input | meta | list[dict] | None | Optional metadata to attach to documents |
| Output | documents | list[Document] | Converted Document objects with extracted text and metadata |
Usage Examples
Basic PDF Conversion
from haystack.components.converters import PyPDFToDocument
converter = PyPDFToDocument()
results = converter.run(sources=["sample.pdf"])
documents = results["documents"]
print(documents[0].content)
# 'This is a text from the PDF file.'
Layout Mode Extraction
from haystack.components.converters import PyPDFToDocument
converter = PyPDFToDocument(extraction_mode="layout")
results = converter.run(sources=["table_document.pdf"])
# Layout mode preserves spatial positioning of text
With Metadata
from datetime import datetime
from haystack.components.converters import PyPDFToDocument
converter = PyPDFToDocument()
results = converter.run(
sources=["report.pdf"],
meta={"date_added": datetime.now().isoformat()}
)
Pipeline Integration
from haystack import Pipeline
from haystack.components.routers import FileTypeRouter
from haystack.components.converters import PyPDFToDocument
from haystack.components.preprocessors import DocumentCleaner, DocumentSplitter
pipeline = Pipeline()
pipeline.add_component("router", FileTypeRouter(mime_types=["application/pdf"]))
pipeline.add_component("converter", PyPDFToDocument())
pipeline.add_component("cleaner", DocumentCleaner())
pipeline.add_component("splitter", DocumentSplitter(split_by="page"))
pipeline.connect("router.application/pdf", "converter.sources")
pipeline.connect("converter.documents", "cleaner.documents")
pipeline.connect("cleaner.documents", "splitter.documents")
result = pipeline.run({"router": {"sources": ["report.pdf"]}})
Related Pages
Implements Principle
- Deepset_ai_Haystack_PDF_Conversion - The principle behind PDF conversion
- Deepset_ai_Haystack_FileTypeRouter - Routes files by MIME type before conversion
- Deepset_ai_Haystack_TextFileToDocument - Text file converter component
- Deepset_ai_Haystack_DocumentCleaner - Cleans converted documents
- Deepset_ai_Haystack_DocumentSplitter - Splits converted documents into chunks
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment