Implementation:Unstructured IO Unstructured Partition Pdf
| Knowledge Sources | |
|---|---|
| Domains | Document_Processing, PDF |
| Last Updated | 2026-02-12 00:00 GMT |
Overview
Concrete tool for partitioning PDF documents into structured elements provided by the Unstructured library.
Description
The partition_pdf function is the format-specific partitioner for PDF documents. It supports all four strategies (auto, fast, hi_res, ocr_only) and exposes PDF-specific parameters for table structure inference, image extraction, form extraction, password handling, and fine-grained pdfminer layout tuning.
Usage
Import this function when you need PDF-specific controls not available through the generic partition() function, such as pdfminer margin tuning, form extraction, or password-protected PDF handling. For general use, prefer partition() which routes to this function automatically for PDF files.
Code Reference
Source Location
- Repository: unstructured
- File: unstructured/partition/pdf.py
- Lines: 130-255
Signature
def partition_pdf(
filename: Optional[str] = None,
file: Optional[IO[bytes]] = None,
include_page_breaks: bool = False,
strategy: str = PartitionStrategy.AUTO,
infer_table_structure: bool = False,
ocr_languages: Optional[str] = None,
languages: Optional[list[str]] = None,
detect_language_per_element: bool = False,
metadata_last_modified: Optional[str] = None,
chunking_strategy: Optional[str] = None,
hi_res_model_name: Optional[str] = None,
extract_images_in_pdf: bool = False,
extract_image_block_types: Optional[list[str]] = None,
extract_image_block_output_dir: Optional[str] = None,
extract_image_block_to_payload: bool = False,
starting_page_number: int = 1,
extract_forms: bool = False,
form_extraction_skip_tables: bool = True,
password: Optional[str] = None,
pdfminer_line_margin: Optional[float] = None,
pdfminer_char_margin: Optional[float] = None,
pdfminer_line_overlap: Optional[float] = None,
pdfminer_word_margin: Optional[float] = 0.185,
**kwargs: Any,
) -> list[Element]:
Import
from unstructured.partition.pdf import partition_pdf
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| filename | None | No | Path to PDF file |
| file | None | No | File-like object with PDF content |
| strategy | str | No | Partition strategy (default "auto") |
| infer_table_structure | bool | No | Infer table row/column structure (default False) |
| languages | None | No | OCR language codes |
| hi_res_model_name | None | No | Layout detection model name |
| extract_forms | bool | No | Extract PDF form fields (default False) |
| password | None | No | Password for encrypted PDFs |
| pdfminer_word_margin | None | No | Word spacing threshold (default 0.185) |
| chunking_strategy | None | No | Apply chunking inline (basic or by_title) |
Outputs
| Name | Type | Description |
|---|---|---|
| return | list[Element] | Ordered list of elements extracted from the PDF, including NarrativeText, Title, Table (with text_as_html), Image, ListItem, etc. |
Usage Examples
High-Resolution PDF with Table Extraction
from unstructured.partition.pdf import partition_pdf
elements = partition_pdf(
filename="financial_report.pdf",
strategy="hi_res",
infer_table_structure=True,
languages=["eng"],
)
# Access table HTML
tables = [el for el in elements if type(el).__name__ == "Table"]
for table in tables:
print(table.metadata.text_as_html)
Password-Protected PDF with Custom Margins
from unstructured.partition.pdf import partition_pdf
elements = partition_pdf(
filename="protected.pdf",
password="secret123",
strategy="fast",
pdfminer_word_margin=0.2,
pdfminer_line_margin=0.5,
)