Heuristic:Unstructured IO Unstructured Golden File Diff
| Knowledge Sources | |
|---|---|
| Domains | Testing, Quality_Assurance, Ingest |
| Last Updated | 2026-02-12 09:30 GMT |
Overview
Golden file diffing compares current pipeline output against committed baseline fixtures to detect regressions in document processing behavior.
Description
The Unstructured ingest test suite uses a golden file (also called snapshot or expected output) strategy for regression testing. Each connector test produces structured JSON output, which is then compared byte-for-byte against committed baseline files in the expected-structured-output/ directory tree.
The comparison is performed by check-diff-expected-output.sh (test_unstructured_ingest/check-diff-expected-output.sh), which runs diff recursively between the actual output directory and the expected baseline directory. If any differences are found, the script prints a unified diff report via diffstat and exits with code 1. A companion script, check-num-files-output.sh, verifies that the number of output files matches an expected count.
When baselines need updating (e.g., after an intentional format change), setting OVERWRITE_FIXTURES=true causes the diff script to copy current output over the baselines instead of failing.
Usage
Apply this heuristic whenever:
- Adding a new ingest connector and creating its initial golden file baseline.
- Modifying partitioning or element serialization logic that changes output format.
- Debugging test failures caused by unexpected output changes in connector pipelines.
- Updating baselines after an intentional format change using OVERWRITE_FIXTURES=true.
The Insight (Rule of Thumb)
- Action: Compare structured JSON output against committed baselines using recursive diff. Fail the test if any byte-level differences are detected.
- Value: Golden file diffs catch unintended regressions in element extraction, metadata population, and serialization format without requiring hand-written assertions for every field.
- Trade-off: Golden files are brittle to intentional changes (every format update requires baseline refresh). Large fixture directories increase repository size. Non-deterministic metadata fields (timestamps, UUIDs) must be excluded or normalized before comparison.
Reasoning
Document processing pipelines produce complex structured output with many fields. Writing individual assertions for every element, metadata field, and edge case is impractical. By committing full expected output and using diff-based comparison, the test suite achieves comprehensive regression coverage with minimal assertion code. The OVERWRITE_FIXTURES escape hatch provides a controlled mechanism for updating baselines after intentional changes, while the diff report makes it easy to review exactly what changed.
Code Evidence
Recursive diff comparison (check-diff-expected-output.sh):
# check-diff-expected-output.sh - compare actual vs expected output
OUTPUT_DIR="structured-output/$1"
EXPECTED_DIR="expected-structured-output/$1"
if [ "$OVERWRITE_FIXTURES" == "true" ]; then
cp -rf "$SCRIPT_DIR/$OUTPUT_DIR/." "$SCRIPT_DIR/$EXPECTED_DIR/"
exit 0
fi
DIFF=$(diff -ruN "$SCRIPT_DIR/$EXPECTED_DIR" "$SCRIPT_DIR/$OUTPUT_DIR")
if [ -n "$DIFF" ]; then
echo "$DIFF" | diffstat
echo "$DIFF"
exit 1
fi
File count validation (check-num-files-output.sh):
# check-num-files-output.sh - verify expected number of output files
EXPECTED_COUNT=$1
OUTPUT_DIR="structured-output/$2"
ACTUAL_COUNT=$(find "$SCRIPT_DIR/$OUTPUT_DIR" -type f | wc -l)
if [ "$ACTUAL_COUNT" -ne "$EXPECTED_COUNT" ]; then
echo "Expected $EXPECTED_COUNT files, found $ACTUAL_COUNT"
exit 1
fi
Related Pages
- Principle:Unstructured_IO_Unstructured_Golden_File_Regression_Testing
- Implementation:Unstructured_IO_Unstructured_Golden_File_Fixtures_Cloud_Storage
- Implementation:Unstructured_IO_Unstructured_Golden_File_Fixtures_Collaboration
- Implementation:Unstructured_IO_Unstructured_Golden_File_Fixtures_Embedding
- Implementation:Unstructured_IO_Unstructured_Golden_File_Fixtures_Local
- Implementation:Unstructured_IO_Unstructured_Example_Docs_Fixture