Implementation:Unstructured IO Unstructured Golden File Fixtures Cloud Storage
| Knowledge Sources | |
|---|---|
| Domains | Testing, Cloud_Storage, Ingest |
| Last Updated | 2026-02-12 09:30 GMT |
Overview
Golden file test fixtures containing expected JSON element output for cloud storage ingest connectors (Azure, S3, SharePoint, Dropbox, Box, OneDrive, Google Drive).
Description
These JSON files represent the expected structured output from processing documents through cloud storage ingest connectors. Each file is a JSON array of element objects conforming to the Unstructured element schema. They serve as regression baselines — CI runs the ingest pipeline and diffs the actual output against these golden files to detect unintended changes in parsing behavior.
The cloud storage connectors covered include:
- Azure Blob Storage — PDF, PNG, TXT, and HTML documents (5 files)
- Amazon S3 — PDF documents including economic reports and forms (4 files)
- SharePoint — PDF documents in nested folder structures (2 files)
- SharePoint with Permissions — Same documents with permission metadata (2 files)
- Dropbox — DOCX and PPTX documents (2 files)
- Box — DOCX and PPTX documents (2 files)
- OneDrive — XLS spreadsheet (1 file)
- Google Drive — PDF report (1 file)
Usage
These fixtures are consumed by the ingest test scripts (e.g., `test_unstructured_ingest/src/azure.sh`, `test_unstructured_ingest/src/s3.sh`) and the diff-checking utilities (`check-diff-expected-output.sh`). They should be updated when intentional changes to parsing behavior occur, using `OVERWRITE_FIXTURES=true`.
Code Reference
Source Location
- Repository: Unstructured_IO_Unstructured
- File: test_unstructured_ingest/expected-structured-output/ (multiple subdirectories)
Files Covered
| Connector | File | Lines |
|---|---|---|
| Azure | azure/Core-Skills-for-Biomedical-Data-Scientists-2-pages.pdf.json | 600 |
| Azure | azure/IRS-form-1987.pdf.json | 1762 |
| Azure | azure/IRS-form-1987.png.json | 904 |
| Azure | azure/rfc854.txt.json | 2942 |
| Azure | azure/spring-weather.html.json | 2606 |
| S3 | s3/2023-Jan-economic-outlook.pdf.json | 3677 |
| S3 | s3/Silent-Giant-(1).pdf.json | 2884 |
| S3 | s3/page-with-formula.pdf.json | 414 |
| S3 | s3/recalibrating-risk-report.pdf.json | 2312 |
| SharePoint | Sharepoint/nested/2023-Jan-economic-outlook.pdf.json | 5246 |
| SharePoint | Sharepoint/nested/page-with-formula.pdf.json | 784 |
| SharePoint+Perms | Sharepoint-with-permissions/nested/2023-Jan-economic-outlook.pdf.json | 5246 |
| SharePoint+Perms | Sharepoint-with-permissions/nested/page-with-formula.pdf.json | 784 |
| Dropbox | dropbox/handbook-1p.docx.json | 346 |
| Dropbox | dropbox/science-exploration-1p.pptx.json | 301 |
| Box | box/handbook-1p.docx.json | 346 |
| Box | box/science-exploration-1p.pptx.json | 301 |
| OneDrive | onedrive/utic-test-ingest-fixtures/tests-example.xls.json | 340 |
| Google Drive | google-drive/recalibrating-risk-report.pdf.json | 9860 |
Signature
[
{
"type": "Title",
"element_id": "hex_string",
"text": "Element text content",
"metadata": {
"languages": ["eng"],
"filetype": "application/pdf",
"data_source": {
"url": "connector_specific_url",
"version": "version_string",
"record_locator": {
"protocol": "s3|abfs|...",
"remote_file_path": "path"
},
"date_created": "unix_timestamp",
"date_modified": "unix_timestamp"
}
}
}
]
Import
# Not importable — consumed by test scripts
# To load a golden file programmatically:
import json
with open("test_unstructured_ingest/expected-structured-output/azure/IRS-form-1987.pdf.json") as f:
expected_elements = json.load(f)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| JSON file path | str | Yes | Path to a golden file under expected-structured-output/ |
Outputs
| Name | Type | Description |
|---|---|---|
| elements | List[Dict] | JSON array of element dicts with type, element_id, text, and metadata |
| metadata.data_source | Dict | Connector-specific provenance (URL, protocol, timestamps) |
| metadata.filetype | str | MIME type of the original document |
| metadata.languages | List[str] | Detected languages (ISO 639 codes) |
Usage Examples
Loading and Inspecting a Golden File
import json
# Load expected output for Azure PDF processing
with open("test_unstructured_ingest/expected-structured-output/azure/IRS-form-1987.pdf.json") as f:
elements = json.load(f)
# Inspect element types
from collections import Counter
type_counts = Counter(e["type"] for e in elements)
print(type_counts)
# Counter({'NarrativeText': 45, 'Title': 12, 'ListItem': 8, ...})
# Check a specific element
first = elements[0]
print(f"Type: {first['type']}, Text: {first['text'][:80]}...")
print(f"Source: {first['metadata']['data_source']['url']}")
Updating Fixtures After Intentional Changes
# Re-run ingest tests with fixture overwrite enabled
OVERWRITE_FIXTURES=true ./test_unstructured_ingest/src/azure.sh
# Review and commit updated fixtures
git diff test_unstructured_ingest/expected-structured-output/azure/
git add test_unstructured_ingest/expected-structured-output/azure/
git commit -m "Update Azure golden files after parsing improvement"