Environment:Unstructured IO Unstructured PDF Dependencies

Knowledge Sources	unstructured
Domains	PDF Processing
Last Updated	2026-02-12 09:00 GMT

Overview

The PDF_Dependencies environment provides all system-level and Python-level dependencies required for partitioning PDF documents, including OCR engines, layout detection models, and rendering utilities.

Description

PDF processing in unstructured is one of the most dependency-intensive operations, requiring a combination of Python packages, system libraries, and optionally machine learning models. The pdf and image extras in pyproject.toml install the necessary Python packages, but several system-level dependencies (Tesseract, Poppler, libmagic) must also be present.

The codebase includes several notable runtime adjustments: the PIL pixel limit is raised to 5e8 (pdf.py:114) to handle large-resolution PDF page renders, and monkey-patches are applied to fix a pdfminer PSParser bug (pdf.py:105-108) and an ICC ColorSpace issue (pdf.py:1043-1053). These patches ensure robust handling of real-world PDF files that may trigger edge cases in upstream libraries.

Behavior is extensively configurable through environment variables, including model selection (UNSTRUCTURED_HI_RES_MODEL_NAME), rendering quality (PDF_RENDER_DPI, default 350), annotation handling (PDF_ANNOTATION_THRESHOLD, default 0.9), OCR engine selection (OCR_AGENT, default tesseract), and thread limiting (OMP_THREAD_LIMIT, auto-set to 1). Additional variables prefixed with ANALYSIS_* and TESSERACT_* control fine-grained analysis and OCR behavior via config.py.

Usage

This environment is required whenever the Partition_Pdf implementation is invoked, whether directly via partition_pdf() or indirectly through partition() when auto-detecting a PDF file type.

System Requirements

Category	Requirement	Notes
Python	>= 3.11, < 3.14	Required Python version range
OS	Linux (recommended), macOS, Windows	Linux provides best compatibility for all OCR and rendering tools
Tesseract OCR	tesseract-ocr >= 4.0	Required for OCR-based text extraction
Poppler	poppler-utils	Required by pdf2image for PDF-to-image rendering
libmagic	libmagic1	Required for MIME type detection
RAM	>= 4 GB recommended	Large PDFs with hi-res model can be memory-intensive

Dependencies

System Packages

tesseract-ocr -- OCR engine for extracting text from images and scanned PDFs
poppler-utils -- provides pdftoppm and pdfinfo for PDF rendering
libmagic1 -- MIME type detection

Python Packages

pdfminer.six >= 20260107 -- PDF text and layout extraction
pypdf >= 6.6.2 -- PDF reading and metadata access
pdf2image >= 1.17.0 -- converts PDF pages to PIL images via Poppler
pi-heif >= 1.2.0 -- HEIF image format support
pikepdf >= 10.3.0 -- PDF repair and manipulation
unstructured-inference >= 1.2.0 -- layout detection and element classification models
unstructured-pytesseract >= 0.3.15 -- Tesseract OCR Python wrapper
google-cloud-vision >= 3.12.1 -- Google Cloud Vision API for OCR (optional alternative to Tesseract)
numpy >= 1.26.0 -- numerical operations for image and layout processing

Credentials

UNSTRUCTURED_HI_RES_MODEL_NAME -- name of the hi-res layout detection model to use
PDF_RENDER_DPI -- rendering resolution in DPI (default: 350)
PDF_ANNOTATION_THRESHOLD -- threshold for annotation detection (default: 0.9)
OCR_AGENT -- OCR engine to use (default: tesseract; alternative: google_cloud_vision)
OMP_THREAD_LIMIT -- OpenMP thread limit (auto-set to 1 to prevent thread contention)
GOOGLE_APPLICATION_CREDENTIALS -- path to Google Cloud service account JSON (only if using Google Cloud Vision)
ANALYSIS_* -- various analysis configuration variables (defined in config.py)
TESSERACT_* -- Tesseract-specific configuration variables (defined in config.py)

Quick Install

# Install system dependencies (Ubuntu/Debian)
sudo apt-get update && sudo apt-get install -y \
    tesseract-ocr poppler-utils libmagic1

# Install unstructured with PDF extras
pip install "unstructured[pdf]"

Code Evidence

PIL pixel limit increase (pdf.py:114):

PIL.Image.MAX_IMAGE_PIXELS = int(5e8)

pdfminer PSParser monkey-patch (pdf.py:105-108):

# Monkey-patch pdfminer PSParser bug that causes infinite loop
# on certain malformed PDF files

ICC ColorSpace monkey-patch (pdf.py:1043-1053):

# Monkey-patch pdfminer ICC ColorSpace issue that raises
# an exception on PDFs with certain color profiles

Environment variable defaults (config.py):

PDF_RENDER_DPI = int(os.environ.get("PDF_RENDER_DPI", "350"))
PDF_ANNOTATION_THRESHOLD = float(os.environ.get("PDF_ANNOTATION_THRESHOLD", "0.9"))
OCR_AGENT = os.environ.get("OCR_AGENT", "tesseract")

Common Errors

Error Message	Cause	Solution
`TesseractNotFoundError: tesseract is not installed`	Tesseract OCR not available on system PATH	Install via `sudo apt-get install tesseract-ocr`
`PDFInfoNotInstalledError: Unable to get page count. Is poppler installed?`	poppler-utils not installed	Install via `sudo apt-get install poppler-utils`
`DecompressionBombError: Image size exceeds limit`	PIL pixel limit too low for large PDF pages	The library sets `MAX_IMAGE_PIXELS = 5e8` automatically; if still triggered, the PDF page is exceptionally large
`ImportError: pdfminer is required`	The pdf extra is not installed	Install via `pip install "unstructured[pdf]"`
Out of memory during hi-res processing	Large PDF with high DPI rendering and layout model	Reduce PDF_RENDER_DPI, use a smaller model, or increase available RAM

Compatibility Notes

OMP_THREAD_LIMIT is automatically set to 1 to prevent thread contention when running layout detection models alongside OCR
The pdfminer PSParser monkey-patch addresses an upstream bug that can cause infinite loops on malformed PDFs
The ICC ColorSpace monkey-patch prevents exceptions on PDFs with non-standard color profiles
Google Cloud Vision can be used as an alternative OCR engine by setting OCR_AGENT=google_cloud_vision and providing valid credentials
Hi-res model selection via UNSTRUCTURED_HI_RES_MODEL_NAME affects both accuracy and processing speed

Related Pages

Implementation:Unstructured_IO_Unstructured_Partition_Pdf

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment