Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Deepset ai Haystack PyPDFToDocument

From Leeroopedia

Template:Metadata

Overview

PyPDFToDocument is a Haystack component that converts PDF files into Document objects using the PyPDF library. It supports configurable text extraction modes (plain and layout), handles multiple input sources, and propagates metadata through the conversion process.

Code Reference

Source file: haystack/components/converters/pypdf.py, lines 51-223

Import:

from haystack.components.converters import PyPDFToDocument

Dependencies: pypdf (install via pip install pypdf)

Constructor

PyPDFToDocument(
    *,
    extraction_mode: str | PyPDFExtractionMode = "plain",
    plain_mode_orientations: tuple = (0, 90, 180, 270),
    plain_mode_space_width: float = 200.0,
    layout_mode_space_vertically: bool = True,
    layout_mode_scale_weight: float = 1.25,
    layout_mode_strip_rotated: bool = True,
    layout_mode_font_height_weight: float = 1.0,
    store_full_path: bool = False
)

Parameters:

  • extraction_mode (str | PyPDFExtractionMode, default "plain"): The text extraction mode. "plain" for standard extraction, "layout" for layout-preserving extraction.
  • plain_mode_orientations (tuple, default (0, 90, 180, 270)): Tuple of text orientations to look for in plain mode. Ignored in layout mode.
  • plain_mode_space_width (float, default 200.0): Forces default space width if not extracted from font. Ignored in layout mode.
  • layout_mode_space_vertically (bool, default True): Whether to include blank lines inferred from y distance and font height. Ignored in plain mode.
  • layout_mode_scale_weight (float, default 1.25): Multiplier for string length when calculating weighted average character width. Ignored in plain mode.
  • layout_mode_strip_rotated (bool, default True): Whether to strip rotated text in layout mode. Ignored in plain mode.
  • layout_mode_font_height_weight (float, default 1.0): Multiplier for font height when calculating blank line height. Ignored in plain mode.
  • store_full_path (bool, default False): If True, stores the full path in metadata. If False, stores only the file name.

Run Method

run(
    sources: list[str | Path | ByteStream],
    meta: dict[str, Any] | list[dict[str, Any]] | None = None
) -> {"documents": list[Document]}

Parameters:

  • sources (list[str | Path | ByteStream], required): A list of PDF file paths or ByteStream objects to convert.
  • meta (dict | list[dict] | None, default None): Optional metadata to attach. A single dictionary applies to all documents; a list must match the number of sources.

I/O Contract

Direction Name Type Description
Input sources Path | ByteStream] PDF file paths or byte streams to convert
Input meta list[dict] | None Optional metadata to attach to documents
Output documents list[Document] Converted Document objects with extracted text and metadata

Usage Examples

Basic PDF Conversion

from haystack.components.converters import PyPDFToDocument

converter = PyPDFToDocument()
results = converter.run(sources=["sample.pdf"])
documents = results["documents"]
print(documents[0].content)
# 'This is a text from the PDF file.'

Layout Mode Extraction

from haystack.components.converters import PyPDFToDocument

converter = PyPDFToDocument(extraction_mode="layout")
results = converter.run(sources=["table_document.pdf"])
# Layout mode preserves spatial positioning of text

With Metadata

from datetime import datetime
from haystack.components.converters import PyPDFToDocument

converter = PyPDFToDocument()
results = converter.run(
    sources=["report.pdf"],
    meta={"date_added": datetime.now().isoformat()}
)

Pipeline Integration

from haystack import Pipeline
from haystack.components.routers import FileTypeRouter
from haystack.components.converters import PyPDFToDocument
from haystack.components.preprocessors import DocumentCleaner, DocumentSplitter

pipeline = Pipeline()
pipeline.add_component("router", FileTypeRouter(mime_types=["application/pdf"]))
pipeline.add_component("converter", PyPDFToDocument())
pipeline.add_component("cleaner", DocumentCleaner())
pipeline.add_component("splitter", DocumentSplitter(split_by="page"))

pipeline.connect("router.application/pdf", "converter.sources")
pipeline.connect("converter.documents", "cleaner.documents")
pipeline.connect("cleaner.documents", "splitter.documents")

result = pipeline.run({"router": {"sources": ["report.pdf"]}})

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment