Implementation:Unstructured IO Unstructured Golden File Fixtures Collaboration

Knowledge Sources	Unstructured_IO_Unstructured
Domains	Testing, Collaboration, Ingest
Last Updated	2026-02-12 09:30 GMT

Overview

Golden file test fixtures containing expected JSON element output for collaboration platform ingest connectors (Confluence, Jira, Notion, Salesforce).

Description

These JSON files represent the expected structured output from processing documents through collaboration and CRM platform ingest connectors. Each file is a JSON array of element objects conforming to the Unstructured element schema. They serve as regression baselines — the CI pipeline diffs actual output against these golden files.

The collaboration connectors covered include:

Confluence — Wiki pages from multiple spaces (6 files across MFS and testteamsp spaces)
Jira — Issue content from project boards (2 files)
Notion — Database and page content (2 files)
Salesforce — Campaign XML records (4 files)

Usage

These fixtures are consumed by the ingest test scripts (e.g., `test_unstructured_ingest/src/confluence-diff.sh`, `test_unstructured_ingest/src/salesforce.sh`) and the diff-checking utilities. Update them when intentional changes to parsing behavior occur.

Code Reference

Source Location

Repository: Unstructured_IO_Unstructured
File: test_unstructured_ingest/expected-structured-output/ (multiple subdirectories)

Files Covered

Connector	File	Lines
Confluence	confluence-diff/MFS/1540126.json	341
Confluence	confluence-diff/MFS/1605956.json	924
Confluence	confluence-diff/MFS/229477.json	1058
Confluence	confluence-diff/testteamsp/1605859.json	1058
Confluence	confluence-diff/testteamsp/1605989.json	815
Confluence	confluence-diff/testteamsp/1802252.json	815
Jira	jira-diff/1/10000.json	464
Jira	jira-diff/1/10001.json	310
Notion	notion/b2a12157-721e-4207-b3b7-527762b782c2.json	356
Notion	notion/c47a4566-4c7a-488b-ac2a-1292ee507fcb.json	631
Salesforce	salesforce/Campaign/701Hu000001eX9EIAU.xml.json	702
Salesforce	salesforce/Campaign/701Hu000001eX9FIAU.xml.json	702
Salesforce	salesforce/Campaign/701Hu000001eX9GIAU.xml.json	702
Salesforce	salesforce/Campaign/701Hu000001eX9HIAU.xml.json	702

Signature

[
  {
    "type": "NarrativeText",
    "element_id": "hex_string",
    "text": "Content from collaboration platform",
    "metadata": {
      "languages": ["eng"],
      "filetype": "application/xml",
      "data_source": {
        "url": "platform_specific_url",
        "record_locator": {
          "protocol": "confluence|jira|notion|salesforce"
        },
        "date_created": "unix_timestamp",
        "date_modified": "unix_timestamp"
      }
    }
  }
]

Import

# Not importable — consumed by test scripts
import json
with open("test_unstructured_ingest/expected-structured-output/confluence-diff/MFS/229477.json") as f:
    expected_elements = json.load(f)

I/O Contract

Inputs

Name	Type	Required	Description
JSON file path	str	Yes	Path to a golden file under expected-structured-output/

Outputs

Name	Type	Description
elements	List[Dict]	JSON array of element dicts with type, element_id, text, and metadata
metadata.data_source	Dict	Platform-specific provenance (URL, record locator, timestamps)
metadata.filetype	str	MIME type of the original document
metadata.languages	List[str]	Detected languages (ISO 639 codes)

Usage Examples

Comparing Actual Output Against Golden File

import json

# Load expected and actual output
with open("test_unstructured_ingest/expected-structured-output/confluence-diff/MFS/229477.json") as f:
    expected = json.load(f)

with open("/tmp/actual-output/229477.json") as f:
    actual = json.load(f)

# Compare element counts
assert len(actual) == len(expected), f"Element count mismatch: {len(actual)} vs {len(expected)}"

# Compare element types
for i, (exp, act) in enumerate(zip(expected, actual)):
    assert exp["type"] == act["type"], f"Element {i}: type mismatch"
    assert exp["text"] == act["text"], f"Element {i}: text mismatch"

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment