Heuristic:Unstructured IO Unstructured PDF Element Sorting
| Knowledge Sources | |
|---|---|
| Domains | PDF Processing, Element Ordering, Layout Analysis |
| Last Updated | 2026-02-12 09:00 GMT |
Overview
PDF elements must be sorted in a two-phase process -- basic sort first for deterministic cross-platform ordering, then the requested sort mode on top -- with special handling for Tesseract output, list items, tables, and several monkey-patches for upstream library bugs.
Description
Extracting elements from PDFs produces an unordered set of bounding boxes and text fragments. The sorting subsystem imposes reading order through a carefully layered approach:
Two-phase sorting (pdf.py:896-897): A basic positional sort is always applied first to guarantee deterministic ordering across different Python versions (which may have different dict/set iteration orders). The user-requested sort mode (e.g., XY-cut) is then applied on top of this stable base.
Tesseract special case (pdf.py:976-977): When Tesseract is used as the OCR engine, its output is already sorted in reading order. The sort mode is therefore set to SORT_MODE_DONT to avoid double-sorting, which could scramble Tesseract's correct order.
ListItem inference disabled for layout-detected lists (pdf.py:780-783): When the layout model detects list elements, the NLP-based ListItem inference rules are disabled because they cause "weird chunking" -- the NLP heuristics misidentify list boundaries and produce fragmented chunks.
Table whitespace handling (pdf.py:831-840): Tables receive special whitespace normalization. Newlines are preserved because they carry structural meaning (row boundaries), while horizontal whitespace is collapsed to remove OCR artifacts and alignment padding.
Suppressed re-classification in hi-res (pdf.py:369-372): In the hi-res path, extracted text blocks are not re-classified by NLP rules to avoid misidentification as Title elements. The layout model's classification is trusted over NLP heuristics.
PIL pixel limit (pdf.py:113-114): The PIL maximum pixel limit is raised to 500 million pixels to support rendering PDFs at 300 DPI, which can produce very large images for oversized pages.
Upstream monkey-patches:
- PDFMiner ICC ColorSpace bug (pdf.py:1043-1053): Patches PDFMiner to handle malformed ICC color profiles that would otherwise cause crashes.
- PDFMiner PSParser bug (pdf.py:105-108): Patches the PostScript parser to handle edge cases in token parsing.
Tunable environment variables:
- UNSTRUCTURED_XY_CUT_BBOX_SHRINK_FACTOR: Controls how much bounding boxes are shrunk before XY-cut sorting to avoid overlap-induced mis-ordering.
- UNSTRUCTURED_XY_CUT_PRIMARY_DIRECTION: Sets whether XY-cut sorts primarily by columns (vertical cuts first) or rows (horizontal cuts first).
Usage
Apply this heuristic when:
- Debugging incorrect reading order in extracted PDF elements.
- Choosing between Tesseract and other OCR engines and needing to understand sorting implications.
- Processing PDFs with complex layouts (multi-column, mixed tables and text).
- Encountering crashes from malformed PDFs (ICC profiles, PS tokens).
- Rendering large-format PDFs at high DPI.
The Insight (Rule of Thumb)
- Action: Always rely on the two-phase sort (basic first, then mode-specific). Use SORT_MODE_DONT when Tesseract provides pre-sorted output. Disable NLP ListItem inference when the layout model already detects lists. Tune XY-cut via UNSTRUCTURED_XY_CUT_BBOX_SHRINK_FACTOR and UNSTRUCTURED_XY_CUT_PRIMARY_DIRECTION.
- Value: Basic sort ensures deterministic output across Python versions. PIL pixel limit is 500M to support 300 DPI. Newlines preserved in tables; horizontal whitespace collapsed. PDFMiner bugs patched at import time.
- Trade-off: Two-phase sorting adds processing time. Disabling NLP ListItem inference loses some list detection capability when the layout model misses items. Raising PIL pixel limit increases memory usage for large pages. Monkey-patching PDFMiner may break if upstream changes internal APIs.
Reasoning
PDF extraction is inherently messy -- different OCR engines, layout models, and Python runtime behaviors can produce different orderings for the same input. The two-phase sort addresses the non-determinism problem by establishing a stable base before applying more sophisticated algorithms. The Tesseract special case avoids a common pitfall where re-sorting already-correct output produces worse results. The table whitespace and ListItem decisions reflect empirical findings from processing real-world documents: NLP heuristics, while useful in general, produce poor results when applied on top of layout model output that already encodes structural information. The monkey-patches are pragmatic fixes for upstream bugs that would otherwise make certain PDFs unprocessable.
Code Evidence
Two-phase sorting for deterministic order (pdf.py:896-897):
# pdf.py:896-897
# Always do basic sort FIRST for deterministic order across Python versions
elements = sort_elements_basic(elements)
elements = sort_elements_by_mode(elements, sort_mode)
Tesseract pre-sorted output (pdf.py:976-977):
# pdf.py:976-977
# Tesseract returns pre-sorted text; avoid double-sorting
if ocr_agent == "tesseract":
sort_mode = SORT_MODE_DONT
ListItem inference disabled for layout lists (pdf.py:780-783):
# pdf.py:780-783
# Disable NLP-based ListItem inference for layout-detected lists
# because the NLP rules cause "weird chunking"
if element_was_detected_by_layout_model:
skip_list_item_inference = True
Table whitespace normalization (pdf.py:831-840):
# pdf.py:831-840
# Tables: preserve newlines (structural), collapse horizontal whitespace
if isinstance(element, Table):
text = re.sub(r"[^\S\n]+", " ", element.text) # collapse horiz whitespace
# newlines preserved -- they indicate row boundaries
element.text = text.strip()
PIL pixel limit raised (pdf.py:113-114):
# pdf.py:113-114
# Raise PIL pixel limit to 500M to support 300 DPI rendering
PIL.Image.MAX_IMAGE_PIXELS = 500_000_000