Environment:Unstructured IO Unstructured PDF Dependencies
| Knowledge Sources | |
|---|---|
| Domains | PDF Processing |
| Last Updated | 2026-02-12 09:00 GMT |
Overview
The PDF_Dependencies environment provides all system-level and Python-level dependencies required for partitioning PDF documents, including OCR engines, layout detection models, and rendering utilities.
Description
PDF processing in unstructured is one of the most dependency-intensive operations, requiring a combination of Python packages, system libraries, and optionally machine learning models. The pdf and image extras in pyproject.toml install the necessary Python packages, but several system-level dependencies (Tesseract, Poppler, libmagic) must also be present.
The codebase includes several notable runtime adjustments: the PIL pixel limit is raised to 5e8 (pdf.py:114) to handle large-resolution PDF page renders, and monkey-patches are applied to fix a pdfminer PSParser bug (pdf.py:105-108) and an ICC ColorSpace issue (pdf.py:1043-1053). These patches ensure robust handling of real-world PDF files that may trigger edge cases in upstream libraries.
Behavior is extensively configurable through environment variables, including model selection (UNSTRUCTURED_HI_RES_MODEL_NAME), rendering quality (PDF_RENDER_DPI, default 350), annotation handling (PDF_ANNOTATION_THRESHOLD, default 0.9), OCR engine selection (OCR_AGENT, default tesseract), and thread limiting (OMP_THREAD_LIMIT, auto-set to 1). Additional variables prefixed with ANALYSIS_* and TESSERACT_* control fine-grained analysis and OCR behavior via config.py.
Usage
This environment is required whenever the Partition_Pdf implementation is invoked, whether directly via partition_pdf() or indirectly through partition() when auto-detecting a PDF file type.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| Python | >= 3.11, < 3.14 | Required Python version range |
| OS | Linux (recommended), macOS, Windows | Linux provides best compatibility for all OCR and rendering tools |
| Tesseract OCR | tesseract-ocr >= 4.0 | Required for OCR-based text extraction |
| Poppler | poppler-utils | Required by pdf2image for PDF-to-image rendering |
| libmagic | libmagic1 | Required for MIME type detection |
| RAM | >= 4 GB recommended | Large PDFs with hi-res model can be memory-intensive |
Dependencies
System Packages
- tesseract-ocr -- OCR engine for extracting text from images and scanned PDFs
- poppler-utils -- provides
pdftoppmandpdfinfofor PDF rendering - libmagic1 -- MIME type detection
Python Packages
- pdfminer.six >= 20260107 -- PDF text and layout extraction
- pypdf >= 6.6.2 -- PDF reading and metadata access
- pdf2image >= 1.17.0 -- converts PDF pages to PIL images via Poppler
- pi-heif >= 1.2.0 -- HEIF image format support
- pikepdf >= 10.3.0 -- PDF repair and manipulation
- unstructured-inference >= 1.2.0 -- layout detection and element classification models
- unstructured-pytesseract >= 0.3.15 -- Tesseract OCR Python wrapper
- google-cloud-vision >= 3.12.1 -- Google Cloud Vision API for OCR (optional alternative to Tesseract)
- numpy >= 1.26.0 -- numerical operations for image and layout processing
Credentials
- UNSTRUCTURED_HI_RES_MODEL_NAME -- name of the hi-res layout detection model to use
- PDF_RENDER_DPI -- rendering resolution in DPI (default: 350)
- PDF_ANNOTATION_THRESHOLD -- threshold for annotation detection (default: 0.9)
- OCR_AGENT -- OCR engine to use (default: tesseract; alternative: google_cloud_vision)
- OMP_THREAD_LIMIT -- OpenMP thread limit (auto-set to 1 to prevent thread contention)
- GOOGLE_APPLICATION_CREDENTIALS -- path to Google Cloud service account JSON (only if using Google Cloud Vision)
- ANALYSIS_* -- various analysis configuration variables (defined in config.py)
- TESSERACT_* -- Tesseract-specific configuration variables (defined in config.py)
Quick Install
# Install system dependencies (Ubuntu/Debian)
sudo apt-get update && sudo apt-get install -y \
tesseract-ocr poppler-utils libmagic1
# Install unstructured with PDF extras
pip install "unstructured[pdf]"
Code Evidence
PIL pixel limit increase (pdf.py:114):
PIL.Image.MAX_IMAGE_PIXELS = int(5e8)
pdfminer PSParser monkey-patch (pdf.py:105-108):
# Monkey-patch pdfminer PSParser bug that causes infinite loop
# on certain malformed PDF files
ICC ColorSpace monkey-patch (pdf.py:1043-1053):
# Monkey-patch pdfminer ICC ColorSpace issue that raises
# an exception on PDFs with certain color profiles
Environment variable defaults (config.py):
PDF_RENDER_DPI = int(os.environ.get("PDF_RENDER_DPI", "350"))
PDF_ANNOTATION_THRESHOLD = float(os.environ.get("PDF_ANNOTATION_THRESHOLD", "0.9"))
OCR_AGENT = os.environ.get("OCR_AGENT", "tesseract")
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
TesseractNotFoundError: tesseract is not installed |
Tesseract OCR not available on system PATH | Install via sudo apt-get install tesseract-ocr
|
PDFInfoNotInstalledError: Unable to get page count. Is poppler installed? |
poppler-utils not installed | Install via sudo apt-get install poppler-utils
|
DecompressionBombError: Image size exceeds limit |
PIL pixel limit too low for large PDF pages | The library sets MAX_IMAGE_PIXELS = 5e8 automatically; if still triggered, the PDF page is exceptionally large
|
ImportError: pdfminer is required |
The pdf extra is not installed | Install via pip install "unstructured[pdf]"
|
| Out of memory during hi-res processing | Large PDF with high DPI rendering and layout model | Reduce PDF_RENDER_DPI, use a smaller model, or increase available RAM |
Compatibility Notes
- OMP_THREAD_LIMIT is automatically set to 1 to prevent thread contention when running layout detection models alongside OCR
- The pdfminer PSParser monkey-patch addresses an upstream bug that can cause infinite loops on malformed PDFs
- The ICC ColorSpace monkey-patch prevents exceptions on PDFs with non-standard color profiles
- Google Cloud Vision can be used as an alternative OCR engine by setting
OCR_AGENT=google_cloud_visionand providing valid credentials - Hi-res model selection via UNSTRUCTURED_HI_RES_MODEL_NAME affects both accuracy and processing speed