Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:Unstructured IO Unstructured PDF Dependencies

From Leeroopedia
Knowledge Sources
Domains PDF Processing
Last Updated 2026-02-12 09:00 GMT

Overview

The PDF_Dependencies environment provides all system-level and Python-level dependencies required for partitioning PDF documents, including OCR engines, layout detection models, and rendering utilities.

Description

PDF processing in unstructured is one of the most dependency-intensive operations, requiring a combination of Python packages, system libraries, and optionally machine learning models. The pdf and image extras in pyproject.toml install the necessary Python packages, but several system-level dependencies (Tesseract, Poppler, libmagic) must also be present.

The codebase includes several notable runtime adjustments: the PIL pixel limit is raised to 5e8 (pdf.py:114) to handle large-resolution PDF page renders, and monkey-patches are applied to fix a pdfminer PSParser bug (pdf.py:105-108) and an ICC ColorSpace issue (pdf.py:1043-1053). These patches ensure robust handling of real-world PDF files that may trigger edge cases in upstream libraries.

Behavior is extensively configurable through environment variables, including model selection (UNSTRUCTURED_HI_RES_MODEL_NAME), rendering quality (PDF_RENDER_DPI, default 350), annotation handling (PDF_ANNOTATION_THRESHOLD, default 0.9), OCR engine selection (OCR_AGENT, default tesseract), and thread limiting (OMP_THREAD_LIMIT, auto-set to 1). Additional variables prefixed with ANALYSIS_* and TESSERACT_* control fine-grained analysis and OCR behavior via config.py.

Usage

This environment is required whenever the Partition_Pdf implementation is invoked, whether directly via partition_pdf() or indirectly through partition() when auto-detecting a PDF file type.

System Requirements

Category Requirement Notes
Python >= 3.11, < 3.14 Required Python version range
OS Linux (recommended), macOS, Windows Linux provides best compatibility for all OCR and rendering tools
Tesseract OCR tesseract-ocr >= 4.0 Required for OCR-based text extraction
Poppler poppler-utils Required by pdf2image for PDF-to-image rendering
libmagic libmagic1 Required for MIME type detection
RAM >= 4 GB recommended Large PDFs with hi-res model can be memory-intensive

Dependencies

System Packages

  • tesseract-ocr -- OCR engine for extracting text from images and scanned PDFs
  • poppler-utils -- provides pdftoppm and pdfinfo for PDF rendering
  • libmagic1 -- MIME type detection

Python Packages

  • pdfminer.six >= 20260107 -- PDF text and layout extraction
  • pypdf >= 6.6.2 -- PDF reading and metadata access
  • pdf2image >= 1.17.0 -- converts PDF pages to PIL images via Poppler
  • pi-heif >= 1.2.0 -- HEIF image format support
  • pikepdf >= 10.3.0 -- PDF repair and manipulation
  • unstructured-inference >= 1.2.0 -- layout detection and element classification models
  • unstructured-pytesseract >= 0.3.15 -- Tesseract OCR Python wrapper
  • google-cloud-vision >= 3.12.1 -- Google Cloud Vision API for OCR (optional alternative to Tesseract)
  • numpy >= 1.26.0 -- numerical operations for image and layout processing

Credentials

  • UNSTRUCTURED_HI_RES_MODEL_NAME -- name of the hi-res layout detection model to use
  • PDF_RENDER_DPI -- rendering resolution in DPI (default: 350)
  • PDF_ANNOTATION_THRESHOLD -- threshold for annotation detection (default: 0.9)
  • OCR_AGENT -- OCR engine to use (default: tesseract; alternative: google_cloud_vision)
  • OMP_THREAD_LIMIT -- OpenMP thread limit (auto-set to 1 to prevent thread contention)
  • GOOGLE_APPLICATION_CREDENTIALS -- path to Google Cloud service account JSON (only if using Google Cloud Vision)
  • ANALYSIS_* -- various analysis configuration variables (defined in config.py)
  • TESSERACT_* -- Tesseract-specific configuration variables (defined in config.py)

Quick Install

# Install system dependencies (Ubuntu/Debian)
sudo apt-get update && sudo apt-get install -y \
    tesseract-ocr poppler-utils libmagic1

# Install unstructured with PDF extras
pip install "unstructured[pdf]"

Code Evidence

PIL pixel limit increase (pdf.py:114):

PIL.Image.MAX_IMAGE_PIXELS = int(5e8)

pdfminer PSParser monkey-patch (pdf.py:105-108):

# Monkey-patch pdfminer PSParser bug that causes infinite loop
# on certain malformed PDF files

ICC ColorSpace monkey-patch (pdf.py:1043-1053):

# Monkey-patch pdfminer ICC ColorSpace issue that raises
# an exception on PDFs with certain color profiles

Environment variable defaults (config.py):

PDF_RENDER_DPI = int(os.environ.get("PDF_RENDER_DPI", "350"))
PDF_ANNOTATION_THRESHOLD = float(os.environ.get("PDF_ANNOTATION_THRESHOLD", "0.9"))
OCR_AGENT = os.environ.get("OCR_AGENT", "tesseract")

Common Errors

Error Message Cause Solution
TesseractNotFoundError: tesseract is not installed Tesseract OCR not available on system PATH Install via sudo apt-get install tesseract-ocr
PDFInfoNotInstalledError: Unable to get page count. Is poppler installed? poppler-utils not installed Install via sudo apt-get install poppler-utils
DecompressionBombError: Image size exceeds limit PIL pixel limit too low for large PDF pages The library sets MAX_IMAGE_PIXELS = 5e8 automatically; if still triggered, the PDF page is exceptionally large
ImportError: pdfminer is required The pdf extra is not installed Install via pip install "unstructured[pdf]"
Out of memory during hi-res processing Large PDF with high DPI rendering and layout model Reduce PDF_RENDER_DPI, use a smaller model, or increase available RAM

Compatibility Notes

  • OMP_THREAD_LIMIT is automatically set to 1 to prevent thread contention when running layout detection models alongside OCR
  • The pdfminer PSParser monkey-patch addresses an upstream bug that can cause infinite loops on malformed PDFs
  • The ICC ColorSpace monkey-patch prevents exceptions on PDFs with non-standard color profiles
  • Google Cloud Vision can be used as an alternative OCR engine by setting OCR_AGENT=google_cloud_vision and providing valid credentials
  • Hi-res model selection via UNSTRUCTURED_HI_RES_MODEL_NAME affects both accuracy and processing speed

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment