Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Unstructured IO Unstructured Hi Res Model Configuration

From Leeroopedia
Knowledge Sources
Domains PDF Processing, Model Configuration, OCR, Strategy Selection
Last Updated 2026-02-12 09:00 GMT

Overview

The hi-res partitioning model is configured through a chain of lazy-loaded defaults, environment variable overrides, and auto-tuned threading settings that must be understood to avoid import-time failures and performance pitfalls.

Description

The hi-res strategy relies on a machine learning layout detection model whose configuration is resolved through several layers:

Lazy model name resolution (pdf.py:118-125): The model name is resolved lazily at first use, not at import time. This is a deliberate design choice so that users can set the UNSTRUCTURED_HI_RES_MODEL_NAME environment variable after importing the module but before calling any partitioning functions. The default_hi_res_model() function defers the import of unstructured_inference.models.base.DEFAULT_MODEL until it is actually needed.

Environment variable overrides:

  • UNSTRUCTURED_HI_RES_MODEL_NAME: Overrides the default layout detection model. Takes precedence over the default from unstructured_inference.
  • UNSTRUCTURED_HI_RES_SUPPORTED_MODEL (model_init.py): Used for validating that a user-specified model name is actually supported.
  • PDF_RENDER_DPI (config.py:244-246): Defaults to 350 for high-quality rendering. Higher DPI improves OCR accuracy but increases memory usage and processing time.
  • OCR_AGENT (config.py:110-112): Defaults to tesseract but can be changed to alternative OCR engines (e.g., paddle).

Threading auto-tuning (tesseract_ocr.py:32-33): OMP_THREAD_LIMIT is automatically set to 1 when Tesseract is used. This prevents OpenMP thread contention that occurs when Tesseract's internal threading conflicts with Python's multiprocessing or other parallel workloads.

Warning suppression (pdf.py:342-344): Python warnings are suppressed during hi-res model detection runs to prevent noisy deprecation warnings from underlying ML libraries (PyTorch, transformers) from cluttering output.

Page count limiting (pdf.py:577-588): The pdf_hi_res_max_pages parameter sets an upper bound on the number of pages processed in hi-res mode. Documents exceeding this limit will only have the first N pages processed, preventing runaway memory and time consumption on large documents.

Additional configuration:

  • HEIF format registration (pdf.py:290): HEIF image format support is registered at partition start to handle HEIF-encoded page images.
  • "auto" language mode (lang.py:310): The "auto" language detection mode is only available for non-PDF/image partitioners. PDF and image partitioners require explicit language specification because their OCR backends need the language parameter upfront.

Usage

Apply this heuristic when:

  • Setting up hi-res partitioning in a new environment and needing to understand which environment variables control behavior.
  • Debugging import-time errors related to model loading (ensure lazy loading is not bypassed).
  • Optimizing throughput in multi-process PDF processing pipelines (check OMP_THREAD_LIMIT).
  • Processing large documents and needing to control memory usage (tune PDF_RENDER_DPI and pdf_hi_res_max_pages).
  • Switching between OCR engines or layout detection models.

The Insight (Rule of Thumb)

  • Action: Set UNSTRUCTURED_HI_RES_MODEL_NAME before first partition call (not at import). Set OMP_THREAD_LIMIT=1 for Tesseract in parallel workloads (auto-done). Use pdf_hi_res_max_pages to limit processing of large documents. Set PDF_RENDER_DPI based on quality-vs-speed needs.
  • Value: Default DPI is 350. Default OCR agent is tesseract. OMP_THREAD_LIMIT is auto-set to 1. Model name is lazy-resolved. "auto" language mode is PDF/image-excluded.
  • Trade-off: Lazy model loading means errors surface at first use rather than import time, which can be surprising in production. OMP_THREAD_LIMIT=1 reduces Tesseract's per-document parallelism but prevents contention in multi-document pipelines. Higher DPI improves quality but increases memory and processing time quadratically.

Reasoning

The lazy-loading pattern for model configuration addresses a real deployment pain point: many applications import the unstructured module at startup but only configure environment variables later (e.g., from a config file or service mesh). Eager loading would force users to set environment variables before any import, which conflicts with standard Python application patterns. The OMP_THREAD_LIMIT auto-tuning reflects empirical findings that Tesseract's default multi-threaded behavior causes severe performance degradation when multiple documents are processed concurrently, because OpenMP threads from different Tesseract instances contend for CPU cores. The page count limit is a safety valve for production deployments where unbounded processing of a single large document could starve other requests.

Code Evidence

Lazy model name resolution (pdf.py:118-125):

# pdf.py:118-125
def default_hi_res_model() -> str:
    """Resolve model name lazily so users can set env var after import."""
    model_name = os.environ.get("UNSTRUCTURED_HI_RES_MODEL_NAME")
    if model_name is not None:
        return model_name
    # Deferred import -- only happens when actually needed
    from unstructured_inference.models.base import DEFAULT_MODEL
    return DEFAULT_MODEL

OMP_THREAD_LIMIT auto-set for Tesseract (tesseract_ocr.py:32-33):

# tesseract_ocr.py:32-33
# Prevent OpenMP thread contention when Tesseract runs alongside
# other parallel workloads
os.environ.setdefault("OMP_THREAD_LIMIT", "1")

PDF_RENDER_DPI default (config.py:244-246):

# config.py:244-246
PDF_RENDER_DPI = int(os.environ.get("PDF_RENDER_DPI", "350"))
# 350 DPI balances OCR accuracy with memory usage for standard documents

Warning suppression during detection (pdf.py:342-344):

# pdf.py:342-344
import warnings
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    detected_layout = model.detect(image)

Page count limit (pdf.py:577-588):

# pdf.py:577-588
if pdf_hi_res_max_pages and len(pages) > pdf_hi_res_max_pages:
    logger.warning(
        f"Document has {len(pages)} pages but pdf_hi_res_max_pages={pdf_hi_res_max_pages}. "
        f"Only the first {pdf_hi_res_max_pages} pages will be processed."
    )
    pages = pages[:pdf_hi_res_max_pages]

"auto" language mode restriction (lang.py:310):

# lang.py:310
# "auto" language detection only for non-PDF/image partitioners
# PDF/image partitioners need explicit language for OCR backends
if language == "auto" and partitioner_type in ("pdf", "image"):
    raise ValueError("'auto' language mode is not supported for PDF/image partitioners")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment