Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Deepset ai Haystack PDF Conversion

From Leeroopedia

Template:Metadata

Overview

PDF Conversion is the principle of extracting textual content from Portable Document Format (PDF) files and transforming it into structured Document objects for downstream processing. PDFs are one of the most common document formats in enterprise and academic settings, yet their internal structure is optimized for visual rendering rather than text extraction. PDF Conversion bridges this gap by parsing the PDF structure and reconstructing readable text.

Description

PDF files store content as a set of drawing instructions rather than as structured text. Text extraction from PDFs involves interpreting these instructions to reconstruct the original character sequences, their positions, and their reading order. The conversion process involves several steps:

  • Byte stream reading: The PDF content is read from a file path or in-memory byte stream into a binary buffer.
  • PDF parsing: A PDF reader library parses the binary structure, identifying pages, fonts, text objects, and their spatial coordinates.
  • Text extraction: For each page, text is extracted according to a configured extraction mode:
    • Plain mode: Extracts text by processing character objects with configurable orientation support and space width handling.
    • Layout mode: An experimental mode that attempts to preserve the visual layout of the PDF in the extracted text, respecting vertical spacing and character positioning.
  • Page concatenation: Text from individual pages is joined using form feed characters (\f), preserving page boundary information for downstream components like document splitters.
  • Document creation: The extracted text is wrapped in a Document object with metadata including the source file path and any user-supplied metadata.

Key Properties

  • Extraction mode flexibility: Supports both plain text extraction and layout-preserving extraction to accommodate different use cases.
  • Page boundary preservation: Form feed characters between pages enable downstream page-aware splitting.
  • Graceful degradation: Files that cannot be parsed or that yield empty text produce warnings rather than pipeline failures.
  • Metadata propagation: File path information and user-supplied metadata are merged into the output Document.

Usage

PDF Conversion is used in document ingestion pipelines where PDF files are a primary or significant input format. It typically sits after a File Type Router that directs PDF files to this converter, and before document cleaning and splitting stages.

A common pipeline flow is:

[FileTypeRouter] --application/pdf--> [PyPDFToDocument] --> [DocumentCleaner] --> [DocumentSplitter] --> [DocumentStore]

Theoretical Basis

The PDF format (ISO 32000) defines documents as collections of objects arranged in a page tree. Text is stored as sequences of character codes with associated font and positioning information. Text extraction algorithms must:

  1. Decode character codes using font encoding tables (which may use custom encodings, CIDFont mappings, or ToUnicode CMaps).
  2. Determine reading order from spatial coordinates, since PDF does not inherently define a linear text flow.
  3. Infer word boundaries from inter-character spacing, using configurable space width thresholds.

Layout mode adds spatial analysis to preserve the two-dimensional arrangement of text on each page, which is important for tables, multi-column layouts, and forms.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment