Principle:Unstructured IO Unstructured Golden File Regression Testing

Knowledge Sources	Golden File Testing Snapshot Testing
Domains	Testing, Quality_Assurance, Ingest
Last Updated	2026-02-12 09:30 GMT

Overview

Testing methodology that validates system output against pre-approved reference files (golden files) to detect unintended behavioral regressions.

Description

Golden file regression testing (also known as snapshot testing or approval testing) is a testing strategy where the expected output of a system is captured and stored as a reference file. On subsequent test runs, the actual output is compared against the golden file — any difference indicates either a bug (unexpected regression) or an intentional change that requires updating the golden file.

In Unstructured, this principle is applied extensively to the ingest pipeline. Each connector (S3, Azure, Confluence, Salesforce, etc.) has golden files that capture the exact JSON element output produced when processing specific test documents. The CI pipeline runs each connector, compares the output against the golden files, and fails if any difference is detected.

This approach is distinct from assertion-based testing (which checks specific properties) in that it captures the complete output, making it effective at detecting unexpected changes in element ordering, metadata fields, text extraction, or formatting.

Usage

Apply golden file regression testing when:

The output is complex and structured (JSON, XML, HTML)
Small changes in processing logic can produce subtle output differences
You need to validate the complete output, not just specific properties
Multiple connectors must produce consistent element formats

It is the right choice for the Unstructured ingest pipeline where each connector produces hundreds to thousands of elements and any change in element type, text, or metadata could break downstream consumers.

Theoretical Basis

The golden file testing cycle operates as follows:

Pseudo-code Logic:

# Abstract golden file testing algorithm (NOT real implementation)
def golden_file_test(connector, test_documents, golden_dir):
    # Step 1: Run the pipeline
    actual_output = connector.process(test_documents)

    # Step 2: Serialize output
    actual_json = serialize_elements(actual_output)

    # Step 3: Compare against golden file
    golden_path = golden_dir / connector.name / test_documents.name + ".json"
    expected_json = read_file(golden_path)

    # Step 4: Diff
    if actual_json != expected_json:
        if OVERWRITE_FIXTURES:
            write_file(golden_path, actual_json)  # Update golden file
        else:
            report_diff(expected_json, actual_json)  # Fail test
            raise RegressionDetected()

Key properties:

Determinism: Golden files assume deterministic output for the same input
Overwrite mode: `OVERWRITE_FIXTURES=true` allows intentional updates
Diff-based reporting: Failures show the exact lines that changed
Per-connector isolation: Each connector has its own golden files

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment