Principle:Unstructured IO Unstructured Golden File Regression Testing
| Knowledge Sources | |
|---|---|
| Domains | Testing, Quality_Assurance, Ingest |
| Last Updated | 2026-02-12 09:30 GMT |
Overview
Testing methodology that validates system output against pre-approved reference files (golden files) to detect unintended behavioral regressions.
Description
Golden file regression testing (also known as snapshot testing or approval testing) is a testing strategy where the expected output of a system is captured and stored as a reference file. On subsequent test runs, the actual output is compared against the golden file — any difference indicates either a bug (unexpected regression) or an intentional change that requires updating the golden file.
In Unstructured, this principle is applied extensively to the ingest pipeline. Each connector (S3, Azure, Confluence, Salesforce, etc.) has golden files that capture the exact JSON element output produced when processing specific test documents. The CI pipeline runs each connector, compares the output against the golden files, and fails if any difference is detected.
This approach is distinct from assertion-based testing (which checks specific properties) in that it captures the complete output, making it effective at detecting unexpected changes in element ordering, metadata fields, text extraction, or formatting.
Usage
Apply golden file regression testing when:
- The output is complex and structured (JSON, XML, HTML)
- Small changes in processing logic can produce subtle output differences
- You need to validate the complete output, not just specific properties
- Multiple connectors must produce consistent element formats
It is the right choice for the Unstructured ingest pipeline where each connector produces hundreds to thousands of elements and any change in element type, text, or metadata could break downstream consumers.
Theoretical Basis
The golden file testing cycle operates as follows:
Pseudo-code Logic:
# Abstract golden file testing algorithm (NOT real implementation)
def golden_file_test(connector, test_documents, golden_dir):
# Step 1: Run the pipeline
actual_output = connector.process(test_documents)
# Step 2: Serialize output
actual_json = serialize_elements(actual_output)
# Step 3: Compare against golden file
golden_path = golden_dir / connector.name / test_documents.name + ".json"
expected_json = read_file(golden_path)
# Step 4: Diff
if actual_json != expected_json:
if OVERWRITE_FIXTURES:
write_file(golden_path, actual_json) # Update golden file
else:
report_diff(expected_json, actual_json) # Fail test
raise RegressionDetected()
Key properties:
- Determinism: Golden files assume deterministic output for the same input
- Overwrite mode: `OVERWRITE_FIXTURES=true` allows intentional updates
- Diff-based reporting: Failures show the exact lines that changed
- Per-connector isolation: Each connector has its own golden files
Related Pages
- Implementation:Unstructured_IO_Unstructured_Golden_File_Fixtures_Cloud_Storage
- Implementation:Unstructured_IO_Unstructured_Golden_File_Fixtures_Collaboration
- Implementation:Unstructured_IO_Unstructured_Golden_File_Fixtures_Embedding
- Implementation:Unstructured_IO_Unstructured_Golden_File_Fixtures_Local
- Implementation:Unstructured_IO_Unstructured_Example_Docs_Fixture
- Heuristic:Unstructured_IO_Unstructured_Golden_File_Diff