Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Unstructured IO Unstructured Golden File Fixtures Local

From Leeroopedia
Knowledge Sources
Domains Testing, Document_Processing, Ingest
Last Updated 2026-02-12 09:30 GMT

Overview

Golden file test fixtures containing expected JSON element output for local file processing, PDF table inference, basic chunking, BioMed, and PDF fast reprocess pipelines.

Description

These JSON files represent the expected structured output from processing documents through local and specialized ingest pipelines. They cover scenarios including plain text partitioning, PDF processing with table structure inference, basic chunking, biomedical paper extraction, and PDF fast reprocessing. Each file is a JSON array of element objects conforming to the Unstructured element schema.

The pipelines covered include:

  • Local single file — Plain text (UDHR multilingual, 11,350 lines)
  • Local single file with basic chunking — DOCX with chunk boundaries (766 lines)
  • Local single file with PDF table inference — PDF and JPG with table structure (2 files, 311-4568 lines)
  • PDF fast reprocess — Reprocessing Azure PDF output (552 lines)
  • BioMed API — Biomedical papers from PubMed Central (2 files, 1197-1413 lines)
  • BioMed Path — Local biomedical paper processing (314 lines)

Usage

These fixtures validate core partitioning logic independent of cloud connectors. They are consumed by local ingest test scripts and the diff-checking utilities.

Code Reference

Source Location

Files Covered

Pipeline File Lines
Local local-single-file/UDHR_first_article_all.txt.json 11350
Local+Chunking local-single-file-basic-chunking/handbook-1p.docx.json 766
Local+TableInfer local-single-file-with-pdf-infer-table-structure/layout-parser-paper-with-table.jpg.json 311
Local+TableInfer local-single-file-with-pdf-infer-table-structure/layout-parser-paper.pdf.json 4568
PDF Reprocess pdf-fast-reprocess/azure/Core-Skills-for-Biomedical-Data-Scientists-2-pages.pdf.json 552
BioMed API biomed-api/65/11/main.PMC6312790.pdf.json 1413
BioMed API biomed-api/75/29/main.PMC6312793.pdf.json 1197
BioMed Path biomed-path/07/07/sbaa031.073.PMC7234218.pdf.json 314

Signature

[
  {
    "type": "NarrativeText",
    "element_id": "hex_string",
    "text": "Parsed document content",
    "metadata": {
      "languages": ["eng"],
      "filetype": "application/pdf",
      "page_number": 1,
      "coordinates": {
        "points": [[x1,y1], [x2,y2], [x3,y3], [x4,y4]],
        "system": "PixelSpace",
        "layout_width": 612,
        "layout_height": 792
      }
    }
  }
]

Import

# Not importable — consumed by test scripts
import json
with open("test_unstructured_ingest/expected-structured-output/local-single-file/UDHR_first_article_all.txt.json") as f:
    expected_elements = json.load(f)

I/O Contract

Inputs

Name Type Required Description
JSON file path str Yes Path to a golden file under expected-structured-output/

Outputs

Name Type Description
elements List[Dict] JSON array of element dicts with type, element_id, text, and metadata
metadata.coordinates Dict Bounding box coordinates for PDF/image elements (when applicable)
metadata.page_number int Page number within multi-page documents
metadata.filetype str MIME type of the original document

Usage Examples

Validating Table Structure Inference Output

import json

# Load golden file for PDF with table structure inference
path = "test_unstructured_ingest/expected-structured-output/"
path += "local-single-file-with-pdf-infer-table-structure/layout-parser-paper.pdf.json"
with open(path) as f:
    elements = json.load(f)

# Find Table elements
tables = [e for e in elements if e["type"] == "Table"]
print(f"Found {len(tables)} table elements")

# Check that table elements have HTML representation
for table in tables:
    if "text_as_html" in table.get("metadata", {}):
        print(f"Table has HTML: {table['metadata']['text_as_html'][:100]}...")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment