Implementation:Unstructured IO Unstructured Partition Pdf

Knowledge Sources	Unstructured pdfminer.six
Domains	Document_Processing, PDF
Last Updated	2026-02-12 00:00 GMT

Overview

Concrete tool for partitioning PDF documents into structured elements provided by the Unstructured library.

Description

The partition_pdf function is the format-specific partitioner for PDF documents. It supports all four strategies (auto, fast, hi_res, ocr_only) and exposes PDF-specific parameters for table structure inference, image extraction, form extraction, password handling, and fine-grained pdfminer layout tuning.

Usage

Import this function when you need PDF-specific controls not available through the generic partition() function, such as pdfminer margin tuning, form extraction, or password-protected PDF handling. For general use, prefer partition() which routes to this function automatically for PDF files.

Code Reference

Source Location

Repository: unstructured
File: unstructured/partition/pdf.py
Lines: 130-255

Signature

def partition_pdf(
    filename: Optional[str] = None,
    file: Optional[IO[bytes]] = None,
    include_page_breaks: bool = False,
    strategy: str = PartitionStrategy.AUTO,
    infer_table_structure: bool = False,
    ocr_languages: Optional[str] = None,
    languages: Optional[list[str]] = None,
    detect_language_per_element: bool = False,
    metadata_last_modified: Optional[str] = None,
    chunking_strategy: Optional[str] = None,
    hi_res_model_name: Optional[str] = None,
    extract_images_in_pdf: bool = False,
    extract_image_block_types: Optional[list[str]] = None,
    extract_image_block_output_dir: Optional[str] = None,
    extract_image_block_to_payload: bool = False,
    starting_page_number: int = 1,
    extract_forms: bool = False,
    form_extraction_skip_tables: bool = True,
    password: Optional[str] = None,
    pdfminer_line_margin: Optional[float] = None,
    pdfminer_char_margin: Optional[float] = None,
    pdfminer_line_overlap: Optional[float] = None,
    pdfminer_word_margin: Optional[float] = 0.185,
    **kwargs: Any,
) -> list[Element]:

Import

from unstructured.partition.pdf import partition_pdf

I/O Contract

Inputs

Name	Type	Required	Description
filename	None	No	Path to PDF file
file	None	No	File-like object with PDF content
strategy	str	No	Partition strategy (default "auto")
infer_table_structure	bool	No	Infer table row/column structure (default False)
languages	None	No	OCR language codes
hi_res_model_name	None	No	Layout detection model name
extract_forms	bool	No	Extract PDF form fields (default False)
password	None	No	Password for encrypted PDFs
pdfminer_word_margin	None	No	Word spacing threshold (default 0.185)
chunking_strategy	None	No	Apply chunking inline (basic or by_title)

Outputs

Name	Type	Description
return	list[Element]	Ordered list of elements extracted from the PDF, including NarrativeText, Title, Table (with text_as_html), Image, ListItem, etc.

Usage Examples

High-Resolution PDF with Table Extraction

from unstructured.partition.pdf import partition_pdf

elements = partition_pdf(
    filename="financial_report.pdf",
    strategy="hi_res",
    infer_table_structure=True,
    languages=["eng"],
)

# Access table HTML
tables = [el for el in elements if type(el).__name__ == "Table"]
for table in tables:
    print(table.metadata.text_as_html)

Password-Protected PDF with Custom Margins

from unstructured.partition.pdf import partition_pdf

elements = partition_pdf(
    filename="protected.pdf",
    password="secret123",
    strategy="fast",
    pdfminer_word_margin=0.2,
    pdfminer_line_margin=0.5,
)

Related Pages

Implements Principle

Principle:Unstructured_IO_Unstructured_PDF_Partitioning

Requires Environment

Environment:Unstructured_IO_Unstructured_PDF_Dependencies

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment