Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Unstructured IO Unstructured Partition Pdf

From Leeroopedia
Knowledge Sources
Domains Document_Processing, PDF
Last Updated 2026-02-12 00:00 GMT

Overview

Concrete tool for partitioning PDF documents into structured elements provided by the Unstructured library.

Description

The partition_pdf function is the format-specific partitioner for PDF documents. It supports all four strategies (auto, fast, hi_res, ocr_only) and exposes PDF-specific parameters for table structure inference, image extraction, form extraction, password handling, and fine-grained pdfminer layout tuning.

Usage

Import this function when you need PDF-specific controls not available through the generic partition() function, such as pdfminer margin tuning, form extraction, or password-protected PDF handling. For general use, prefer partition() which routes to this function automatically for PDF files.

Code Reference

Source Location

  • Repository: unstructured
  • File: unstructured/partition/pdf.py
  • Lines: 130-255

Signature

def partition_pdf(
    filename: Optional[str] = None,
    file: Optional[IO[bytes]] = None,
    include_page_breaks: bool = False,
    strategy: str = PartitionStrategy.AUTO,
    infer_table_structure: bool = False,
    ocr_languages: Optional[str] = None,
    languages: Optional[list[str]] = None,
    detect_language_per_element: bool = False,
    metadata_last_modified: Optional[str] = None,
    chunking_strategy: Optional[str] = None,
    hi_res_model_name: Optional[str] = None,
    extract_images_in_pdf: bool = False,
    extract_image_block_types: Optional[list[str]] = None,
    extract_image_block_output_dir: Optional[str] = None,
    extract_image_block_to_payload: bool = False,
    starting_page_number: int = 1,
    extract_forms: bool = False,
    form_extraction_skip_tables: bool = True,
    password: Optional[str] = None,
    pdfminer_line_margin: Optional[float] = None,
    pdfminer_char_margin: Optional[float] = None,
    pdfminer_line_overlap: Optional[float] = None,
    pdfminer_word_margin: Optional[float] = 0.185,
    **kwargs: Any,
) -> list[Element]:

Import

from unstructured.partition.pdf import partition_pdf

I/O Contract

Inputs

Name Type Required Description
filename None No Path to PDF file
file None No File-like object with PDF content
strategy str No Partition strategy (default "auto")
infer_table_structure bool No Infer table row/column structure (default False)
languages None No OCR language codes
hi_res_model_name None No Layout detection model name
extract_forms bool No Extract PDF form fields (default False)
password None No Password for encrypted PDFs
pdfminer_word_margin None No Word spacing threshold (default 0.185)
chunking_strategy None No Apply chunking inline (basic or by_title)

Outputs

Name Type Description
return list[Element] Ordered list of elements extracted from the PDF, including NarrativeText, Title, Table (with text_as_html), Image, ListItem, etc.

Usage Examples

High-Resolution PDF with Table Extraction

from unstructured.partition.pdf import partition_pdf

elements = partition_pdf(
    filename="financial_report.pdf",
    strategy="hi_res",
    infer_table_structure=True,
    languages=["eng"],
)

# Access table HTML
tables = [el for el in elements if type(el).__name__ == "Table"]
for table in tables:
    print(table.metadata.text_as_html)

Password-Protected PDF with Custom Margins

from unstructured.partition.pdf import partition_pdf

elements = partition_pdf(
    filename="protected.pdf",
    password="secret123",
    strategy="fast",
    pdfminer_word_margin=0.2,
    pdfminer_line_margin=0.5,
)

Related Pages

Implements Principle

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment