Heuristic:Unstructured IO Unstructured Hi Res Model Configuration
| Knowledge Sources | |
|---|---|
| Domains | PDF Processing, Model Configuration, OCR, Strategy Selection |
| Last Updated | 2026-02-12 09:00 GMT |
Overview
The hi-res partitioning model is configured through a chain of lazy-loaded defaults, environment variable overrides, and auto-tuned threading settings that must be understood to avoid import-time failures and performance pitfalls.
Description
The hi-res strategy relies on a machine learning layout detection model whose configuration is resolved through several layers:
Lazy model name resolution (pdf.py:118-125): The model name is resolved lazily at first use, not at import time. This is a deliberate design choice so that users can set the UNSTRUCTURED_HI_RES_MODEL_NAME environment variable after importing the module but before calling any partitioning functions. The default_hi_res_model() function defers the import of unstructured_inference.models.base.DEFAULT_MODEL until it is actually needed.
Environment variable overrides:
- UNSTRUCTURED_HI_RES_MODEL_NAME: Overrides the default layout detection model. Takes precedence over the default from unstructured_inference.
- UNSTRUCTURED_HI_RES_SUPPORTED_MODEL (model_init.py): Used for validating that a user-specified model name is actually supported.
- PDF_RENDER_DPI (config.py:244-246): Defaults to 350 for high-quality rendering. Higher DPI improves OCR accuracy but increases memory usage and processing time.
- OCR_AGENT (config.py:110-112): Defaults to tesseract but can be changed to alternative OCR engines (e.g., paddle).
Threading auto-tuning (tesseract_ocr.py:32-33): OMP_THREAD_LIMIT is automatically set to 1 when Tesseract is used. This prevents OpenMP thread contention that occurs when Tesseract's internal threading conflicts with Python's multiprocessing or other parallel workloads.
Warning suppression (pdf.py:342-344): Python warnings are suppressed during hi-res model detection runs to prevent noisy deprecation warnings from underlying ML libraries (PyTorch, transformers) from cluttering output.
Page count limiting (pdf.py:577-588): The pdf_hi_res_max_pages parameter sets an upper bound on the number of pages processed in hi-res mode. Documents exceeding this limit will only have the first N pages processed, preventing runaway memory and time consumption on large documents.
Additional configuration:
- HEIF format registration (pdf.py:290): HEIF image format support is registered at partition start to handle HEIF-encoded page images.
- "auto" language mode (lang.py:310): The "auto" language detection mode is only available for non-PDF/image partitioners. PDF and image partitioners require explicit language specification because their OCR backends need the language parameter upfront.
Usage
Apply this heuristic when:
- Setting up hi-res partitioning in a new environment and needing to understand which environment variables control behavior.
- Debugging import-time errors related to model loading (ensure lazy loading is not bypassed).
- Optimizing throughput in multi-process PDF processing pipelines (check OMP_THREAD_LIMIT).
- Processing large documents and needing to control memory usage (tune PDF_RENDER_DPI and pdf_hi_res_max_pages).
- Switching between OCR engines or layout detection models.
The Insight (Rule of Thumb)
- Action: Set UNSTRUCTURED_HI_RES_MODEL_NAME before first partition call (not at import). Set OMP_THREAD_LIMIT=1 for Tesseract in parallel workloads (auto-done). Use pdf_hi_res_max_pages to limit processing of large documents. Set PDF_RENDER_DPI based on quality-vs-speed needs.
- Value: Default DPI is 350. Default OCR agent is tesseract. OMP_THREAD_LIMIT is auto-set to 1. Model name is lazy-resolved. "auto" language mode is PDF/image-excluded.
- Trade-off: Lazy model loading means errors surface at first use rather than import time, which can be surprising in production. OMP_THREAD_LIMIT=1 reduces Tesseract's per-document parallelism but prevents contention in multi-document pipelines. Higher DPI improves quality but increases memory and processing time quadratically.
Reasoning
The lazy-loading pattern for model configuration addresses a real deployment pain point: many applications import the unstructured module at startup but only configure environment variables later (e.g., from a config file or service mesh). Eager loading would force users to set environment variables before any import, which conflicts with standard Python application patterns. The OMP_THREAD_LIMIT auto-tuning reflects empirical findings that Tesseract's default multi-threaded behavior causes severe performance degradation when multiple documents are processed concurrently, because OpenMP threads from different Tesseract instances contend for CPU cores. The page count limit is a safety valve for production deployments where unbounded processing of a single large document could starve other requests.
Code Evidence
Lazy model name resolution (pdf.py:118-125):
# pdf.py:118-125
def default_hi_res_model() -> str:
"""Resolve model name lazily so users can set env var after import."""
model_name = os.environ.get("UNSTRUCTURED_HI_RES_MODEL_NAME")
if model_name is not None:
return model_name
# Deferred import -- only happens when actually needed
from unstructured_inference.models.base import DEFAULT_MODEL
return DEFAULT_MODEL
OMP_THREAD_LIMIT auto-set for Tesseract (tesseract_ocr.py:32-33):
# tesseract_ocr.py:32-33
# Prevent OpenMP thread contention when Tesseract runs alongside
# other parallel workloads
os.environ.setdefault("OMP_THREAD_LIMIT", "1")
PDF_RENDER_DPI default (config.py:244-246):
# config.py:244-246
PDF_RENDER_DPI = int(os.environ.get("PDF_RENDER_DPI", "350"))
# 350 DPI balances OCR accuracy with memory usage for standard documents
Warning suppression during detection (pdf.py:342-344):
# pdf.py:342-344
import warnings
with warnings.catch_warnings():
warnings.simplefilter("ignore")
detected_layout = model.detect(image)
Page count limit (pdf.py:577-588):
# pdf.py:577-588
if pdf_hi_res_max_pages and len(pages) > pdf_hi_res_max_pages:
logger.warning(
f"Document has {len(pages)} pages but pdf_hi_res_max_pages={pdf_hi_res_max_pages}. "
f"Only the first {pdf_hi_res_max_pages} pages will be processed."
)
pages = pages[:pdf_hi_res_max_pages]
"auto" language mode restriction (lang.py:310):
# lang.py:310
# "auto" language detection only for non-PDF/image partitioners
# PDF/image partitioners need explicit language for OCR backends
if language == "auto" and partitioner_type in ("pdf", "image"):
raise ValueError("'auto' language mode is not supported for PDF/image partitioners")