Implementation:Deepset ai Haystack PyPDFToDocument

Overview

PyPDFToDocument is a Haystack component that converts PDF files into Document objects using the PyPDF library. It supports configurable text extraction modes (plain and layout), handles multiple input sources, and propagates metadata through the conversion process.

Code Reference

Source file: haystack/components/converters/pypdf.py, lines 51-223

Import:

from haystack.components.converters import PyPDFToDocument

Dependencies: pypdf (install via pip install pypdf)

Constructor

PyPDFToDocument(
    *,
    extraction_mode: str | PyPDFExtractionMode = "plain",
    plain_mode_orientations: tuple = (0, 90, 180, 270),
    plain_mode_space_width: float = 200.0,
    layout_mode_space_vertically: bool = True,
    layout_mode_scale_weight: float = 1.25,
    layout_mode_strip_rotated: bool = True,
    layout_mode_font_height_weight: float = 1.0,
    store_full_path: bool = False
)

Parameters:

extraction_mode (str | PyPDFExtractionMode, default "plain"): The text extraction mode. "plain" for standard extraction, "layout" for layout-preserving extraction.
plain_mode_orientations (tuple, default (0, 90, 180, 270)): Tuple of text orientations to look for in plain mode. Ignored in layout mode.
plain_mode_space_width (float, default 200.0): Forces default space width if not extracted from font. Ignored in layout mode.
layout_mode_space_vertically (bool, default True): Whether to include blank lines inferred from y distance and font height. Ignored in plain mode.
layout_mode_scale_weight (float, default 1.25): Multiplier for string length when calculating weighted average character width. Ignored in plain mode.
layout_mode_strip_rotated (bool, default True): Whether to strip rotated text in layout mode. Ignored in plain mode.
layout_mode_font_height_weight (float, default 1.0): Multiplier for font height when calculating blank line height. Ignored in plain mode.
store_full_path (bool, default False): If True, stores the full path in metadata. If False, stores only the file name.

Run Method

run(
    sources: list[str | Path | ByteStream],
    meta: dict[str, Any] | list[dict[str, Any]] | None = None
) -> {"documents": list[Document]}

Parameters:

sources (list[str | Path | ByteStream], required): A list of PDF file paths or ByteStream objects to convert.
meta (dict | list[dict] | None, default None): Optional metadata to attach. A single dictionary applies to all documents; a list must match the number of sources.

I/O Contract

Direction	Name	Type	Description
Input	sources	Path \| ByteStream]	PDF file paths or byte streams to convert
Input	meta	list[dict] \| None	Optional metadata to attach to documents
Output	documents	list[Document]	Converted Document objects with extracted text and metadata

Usage Examples

Basic PDF Conversion

from haystack.components.converters import PyPDFToDocument

converter = PyPDFToDocument()
results = converter.run(sources=["sample.pdf"])
documents = results["documents"]
print(documents[0].content)
# 'This is a text from the PDF file.'

Layout Mode Extraction

from haystack.components.converters import PyPDFToDocument

converter = PyPDFToDocument(extraction_mode="layout")
results = converter.run(sources=["table_document.pdf"])
# Layout mode preserves spatial positioning of text

With Metadata

from datetime import datetime
from haystack.components.converters import PyPDFToDocument

converter = PyPDFToDocument()
results = converter.run(
    sources=["report.pdf"],
    meta={"date_added": datetime.now().isoformat()}
)

Pipeline Integration

from haystack import Pipeline
from haystack.components.routers import FileTypeRouter
from haystack.components.converters import PyPDFToDocument
from haystack.components.preprocessors import DocumentCleaner, DocumentSplitter

pipeline = Pipeline()
pipeline.add_component("router", FileTypeRouter(mime_types=["application/pdf"]))
pipeline.add_component("converter", PyPDFToDocument())
pipeline.add_component("cleaner", DocumentCleaner())
pipeline.add_component("splitter", DocumentSplitter(split_by="page"))

pipeline.connect("router.application/pdf", "converter.sources")
pipeline.connect("converter.documents", "cleaner.documents")
pipeline.connect("cleaner.documents", "splitter.documents")

result = pipeline.run({"router": {"sources": ["report.pdf"]}})

Related Pages

Implements Principle

Principle:Deepset_ai_Haystack_PDF_Conversion

Deepset_ai_Haystack_PDF_Conversion - The principle behind PDF conversion
Deepset_ai_Haystack_FileTypeRouter - Routes files by MIME type before conversion
Deepset_ai_Haystack_TextFileToDocument - Text file converter component
Deepset_ai_Haystack_DocumentCleaner - Cleans converted documents
Deepset_ai_Haystack_DocumentSplitter - Splits converted documents into chunks

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment