Implementation:Unstructured IO Unstructured Golden File Fixtures Collaboration
| Knowledge Sources | |
|---|---|
| Domains | Testing, Collaboration, Ingest |
| Last Updated | 2026-02-12 09:30 GMT |
Overview
Golden file test fixtures containing expected JSON element output for collaboration platform ingest connectors (Confluence, Jira, Notion, Salesforce).
Description
These JSON files represent the expected structured output from processing documents through collaboration and CRM platform ingest connectors. Each file is a JSON array of element objects conforming to the Unstructured element schema. They serve as regression baselines — the CI pipeline diffs actual output against these golden files.
The collaboration connectors covered include:
- Confluence — Wiki pages from multiple spaces (6 files across MFS and testteamsp spaces)
- Jira — Issue content from project boards (2 files)
- Notion — Database and page content (2 files)
- Salesforce — Campaign XML records (4 files)
Usage
These fixtures are consumed by the ingest test scripts (e.g., `test_unstructured_ingest/src/confluence-diff.sh`, `test_unstructured_ingest/src/salesforce.sh`) and the diff-checking utilities. Update them when intentional changes to parsing behavior occur.
Code Reference
Source Location
- Repository: Unstructured_IO_Unstructured
- File: test_unstructured_ingest/expected-structured-output/ (multiple subdirectories)
Files Covered
| Connector | File | Lines |
|---|---|---|
| Confluence | confluence-diff/MFS/1540126.json | 341 |
| Confluence | confluence-diff/MFS/1605956.json | 924 |
| Confluence | confluence-diff/MFS/229477.json | 1058 |
| Confluence | confluence-diff/testteamsp/1605859.json | 1058 |
| Confluence | confluence-diff/testteamsp/1605989.json | 815 |
| Confluence | confluence-diff/testteamsp/1802252.json | 815 |
| Jira | jira-diff/1/10000.json | 464 |
| Jira | jira-diff/1/10001.json | 310 |
| Notion | notion/b2a12157-721e-4207-b3b7-527762b782c2.json | 356 |
| Notion | notion/c47a4566-4c7a-488b-ac2a-1292ee507fcb.json | 631 |
| Salesforce | salesforce/Campaign/701Hu000001eX9EIAU.xml.json | 702 |
| Salesforce | salesforce/Campaign/701Hu000001eX9FIAU.xml.json | 702 |
| Salesforce | salesforce/Campaign/701Hu000001eX9GIAU.xml.json | 702 |
| Salesforce | salesforce/Campaign/701Hu000001eX9HIAU.xml.json | 702 |
Signature
[
{
"type": "NarrativeText",
"element_id": "hex_string",
"text": "Content from collaboration platform",
"metadata": {
"languages": ["eng"],
"filetype": "application/xml",
"data_source": {
"url": "platform_specific_url",
"record_locator": {
"protocol": "confluence|jira|notion|salesforce"
},
"date_created": "unix_timestamp",
"date_modified": "unix_timestamp"
}
}
}
]
Import
# Not importable — consumed by test scripts
import json
with open("test_unstructured_ingest/expected-structured-output/confluence-diff/MFS/229477.json") as f:
expected_elements = json.load(f)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| JSON file path | str | Yes | Path to a golden file under expected-structured-output/ |
Outputs
| Name | Type | Description |
|---|---|---|
| elements | List[Dict] | JSON array of element dicts with type, element_id, text, and metadata |
| metadata.data_source | Dict | Platform-specific provenance (URL, record locator, timestamps) |
| metadata.filetype | str | MIME type of the original document |
| metadata.languages | List[str] | Detected languages (ISO 639 codes) |
Usage Examples
Comparing Actual Output Against Golden File
import json
# Load expected and actual output
with open("test_unstructured_ingest/expected-structured-output/confluence-diff/MFS/229477.json") as f:
expected = json.load(f)
with open("/tmp/actual-output/229477.json") as f:
actual = json.load(f)
# Compare element counts
assert len(actual) == len(expected), f"Element count mismatch: {len(actual)} vs {len(expected)}"
# Compare element types
for i, (exp, act) in enumerate(zip(expected, actual)):
assert exp["type"] == act["type"], f"Element {i}: type mismatch"
assert exp["text"] == act["text"], f"Element {i}: text mismatch"