Implementation:Unstructured IO Unstructured Golden File Fixtures Local
| Knowledge Sources | |
|---|---|
| Domains | Testing, Document_Processing, Ingest |
| Last Updated | 2026-02-12 09:30 GMT |
Overview
Golden file test fixtures containing expected JSON element output for local file processing, PDF table inference, basic chunking, BioMed, and PDF fast reprocess pipelines.
Description
These JSON files represent the expected structured output from processing documents through local and specialized ingest pipelines. They cover scenarios including plain text partitioning, PDF processing with table structure inference, basic chunking, biomedical paper extraction, and PDF fast reprocessing. Each file is a JSON array of element objects conforming to the Unstructured element schema.
The pipelines covered include:
- Local single file — Plain text (UDHR multilingual, 11,350 lines)
- Local single file with basic chunking — DOCX with chunk boundaries (766 lines)
- Local single file with PDF table inference — PDF and JPG with table structure (2 files, 311-4568 lines)
- PDF fast reprocess — Reprocessing Azure PDF output (552 lines)
- BioMed API — Biomedical papers from PubMed Central (2 files, 1197-1413 lines)
- BioMed Path — Local biomedical paper processing (314 lines)
Usage
These fixtures validate core partitioning logic independent of cloud connectors. They are consumed by local ingest test scripts and the diff-checking utilities.
Code Reference
Source Location
- Repository: Unstructured_IO_Unstructured
- File: test_unstructured_ingest/expected-structured-output/ (multiple subdirectories)
Files Covered
| Pipeline | File | Lines |
|---|---|---|
| Local | local-single-file/UDHR_first_article_all.txt.json | 11350 |
| Local+Chunking | local-single-file-basic-chunking/handbook-1p.docx.json | 766 |
| Local+TableInfer | local-single-file-with-pdf-infer-table-structure/layout-parser-paper-with-table.jpg.json | 311 |
| Local+TableInfer | local-single-file-with-pdf-infer-table-structure/layout-parser-paper.pdf.json | 4568 |
| PDF Reprocess | pdf-fast-reprocess/azure/Core-Skills-for-Biomedical-Data-Scientists-2-pages.pdf.json | 552 |
| BioMed API | biomed-api/65/11/main.PMC6312790.pdf.json | 1413 |
| BioMed API | biomed-api/75/29/main.PMC6312793.pdf.json | 1197 |
| BioMed Path | biomed-path/07/07/sbaa031.073.PMC7234218.pdf.json | 314 |
Signature
[
{
"type": "NarrativeText",
"element_id": "hex_string",
"text": "Parsed document content",
"metadata": {
"languages": ["eng"],
"filetype": "application/pdf",
"page_number": 1,
"coordinates": {
"points": [[x1,y1], [x2,y2], [x3,y3], [x4,y4]],
"system": "PixelSpace",
"layout_width": 612,
"layout_height": 792
}
}
}
]
Import
# Not importable — consumed by test scripts
import json
with open("test_unstructured_ingest/expected-structured-output/local-single-file/UDHR_first_article_all.txt.json") as f:
expected_elements = json.load(f)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| JSON file path | str | Yes | Path to a golden file under expected-structured-output/ |
Outputs
| Name | Type | Description |
|---|---|---|
| elements | List[Dict] | JSON array of element dicts with type, element_id, text, and metadata |
| metadata.coordinates | Dict | Bounding box coordinates for PDF/image elements (when applicable) |
| metadata.page_number | int | Page number within multi-page documents |
| metadata.filetype | str | MIME type of the original document |
Usage Examples
Validating Table Structure Inference Output
import json
# Load golden file for PDF with table structure inference
path = "test_unstructured_ingest/expected-structured-output/"
path += "local-single-file-with-pdf-infer-table-structure/layout-parser-paper.pdf.json"
with open(path) as f:
elements = json.load(f)
# Find Table elements
tables = [e for e in elements if e["type"] == "Table"]
print(f"Found {len(tables)} table elements")
# Check that table elements have HTML representation
for table in tables:
if "text_as_html" in table.get("metadata", {}):
print(f"Table has HTML: {table['metadata']['text_as_html'][:100]}...")