Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Unstructured IO Unstructured Golden File Fixtures Cloud Storage

From Leeroopedia
Knowledge Sources
Domains Testing, Cloud_Storage, Ingest
Last Updated 2026-02-12 09:30 GMT

Overview

Golden file test fixtures containing expected JSON element output for cloud storage ingest connectors (Azure, S3, SharePoint, Dropbox, Box, OneDrive, Google Drive).

Description

These JSON files represent the expected structured output from processing documents through cloud storage ingest connectors. Each file is a JSON array of element objects conforming to the Unstructured element schema. They serve as regression baselines — CI runs the ingest pipeline and diffs the actual output against these golden files to detect unintended changes in parsing behavior.

The cloud storage connectors covered include:

  • Azure Blob Storage — PDF, PNG, TXT, and HTML documents (5 files)
  • Amazon S3 — PDF documents including economic reports and forms (4 files)
  • SharePoint — PDF documents in nested folder structures (2 files)
  • SharePoint with Permissions — Same documents with permission metadata (2 files)
  • Dropbox — DOCX and PPTX documents (2 files)
  • Box — DOCX and PPTX documents (2 files)
  • OneDrive — XLS spreadsheet (1 file)
  • Google Drive — PDF report (1 file)

Usage

These fixtures are consumed by the ingest test scripts (e.g., `test_unstructured_ingest/src/azure.sh`, `test_unstructured_ingest/src/s3.sh`) and the diff-checking utilities (`check-diff-expected-output.sh`). They should be updated when intentional changes to parsing behavior occur, using `OVERWRITE_FIXTURES=true`.

Code Reference

Source Location

Files Covered

Connector File Lines
Azure azure/Core-Skills-for-Biomedical-Data-Scientists-2-pages.pdf.json 600
Azure azure/IRS-form-1987.pdf.json 1762
Azure azure/IRS-form-1987.png.json 904
Azure azure/rfc854.txt.json 2942
Azure azure/spring-weather.html.json 2606
S3 s3/2023-Jan-economic-outlook.pdf.json 3677
S3 s3/Silent-Giant-(1).pdf.json 2884
S3 s3/page-with-formula.pdf.json 414
S3 s3/recalibrating-risk-report.pdf.json 2312
SharePoint Sharepoint/nested/2023-Jan-economic-outlook.pdf.json 5246
SharePoint Sharepoint/nested/page-with-formula.pdf.json 784
SharePoint+Perms Sharepoint-with-permissions/nested/2023-Jan-economic-outlook.pdf.json 5246
SharePoint+Perms Sharepoint-with-permissions/nested/page-with-formula.pdf.json 784
Dropbox dropbox/handbook-1p.docx.json 346
Dropbox dropbox/science-exploration-1p.pptx.json 301
Box box/handbook-1p.docx.json 346
Box box/science-exploration-1p.pptx.json 301
OneDrive onedrive/utic-test-ingest-fixtures/tests-example.xls.json 340
Google Drive google-drive/recalibrating-risk-report.pdf.json 9860

Signature

[
  {
    "type": "Title",
    "element_id": "hex_string",
    "text": "Element text content",
    "metadata": {
      "languages": ["eng"],
      "filetype": "application/pdf",
      "data_source": {
        "url": "connector_specific_url",
        "version": "version_string",
        "record_locator": {
          "protocol": "s3|abfs|...",
          "remote_file_path": "path"
        },
        "date_created": "unix_timestamp",
        "date_modified": "unix_timestamp"
      }
    }
  }
]

Import

# Not importable — consumed by test scripts
# To load a golden file programmatically:
import json
with open("test_unstructured_ingest/expected-structured-output/azure/IRS-form-1987.pdf.json") as f:
    expected_elements = json.load(f)

I/O Contract

Inputs

Name Type Required Description
JSON file path str Yes Path to a golden file under expected-structured-output/

Outputs

Name Type Description
elements List[Dict] JSON array of element dicts with type, element_id, text, and metadata
metadata.data_source Dict Connector-specific provenance (URL, protocol, timestamps)
metadata.filetype str MIME type of the original document
metadata.languages List[str] Detected languages (ISO 639 codes)

Usage Examples

Loading and Inspecting a Golden File

import json

# Load expected output for Azure PDF processing
with open("test_unstructured_ingest/expected-structured-output/azure/IRS-form-1987.pdf.json") as f:
    elements = json.load(f)

# Inspect element types
from collections import Counter
type_counts = Counter(e["type"] for e in elements)
print(type_counts)
# Counter({'NarrativeText': 45, 'Title': 12, 'ListItem': 8, ...})

# Check a specific element
first = elements[0]
print(f"Type: {first['type']}, Text: {first['text'][:80]}...")
print(f"Source: {first['metadata']['data_source']['url']}")

Updating Fixtures After Intentional Changes

# Re-run ingest tests with fixture overwrite enabled
OVERWRITE_FIXTURES=true ./test_unstructured_ingest/src/azure.sh

# Review and commit updated fixtures
git diff test_unstructured_ingest/expected-structured-output/azure/
git add test_unstructured_ingest/expected-structured-output/azure/
git commit -m "Update Azure golden files after parsing improvement"

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment