Heuristic:Unstructured IO Unstructured Golden File Diff

Knowledge Sources	unstructured
Domains	Testing, Quality_Assurance, Ingest
Last Updated	2026-02-12 09:30 GMT

Overview

Golden file diffing compares current pipeline output against committed baseline fixtures to detect regressions in document processing behavior.

Description

The Unstructured ingest test suite uses a golden file (also called snapshot or expected output) strategy for regression testing. Each connector test produces structured JSON output, which is then compared byte-for-byte against committed baseline files in the expected-structured-output/ directory tree.

The comparison is performed by check-diff-expected-output.sh (test_unstructured_ingest/check-diff-expected-output.sh), which runs diff recursively between the actual output directory and the expected baseline directory. If any differences are found, the script prints a unified diff report via diffstat and exits with code 1. A companion script, check-num-files-output.sh, verifies that the number of output files matches an expected count.

When baselines need updating (e.g., after an intentional format change), setting OVERWRITE_FIXTURES=true causes the diff script to copy current output over the baselines instead of failing.

Usage

Apply this heuristic whenever:

Adding a new ingest connector and creating its initial golden file baseline.
Modifying partitioning or element serialization logic that changes output format.
Debugging test failures caused by unexpected output changes in connector pipelines.
Updating baselines after an intentional format change using OVERWRITE_FIXTURES=true.

The Insight (Rule of Thumb)

Action: Compare structured JSON output against committed baselines using recursive diff. Fail the test if any byte-level differences are detected.
Value: Golden file diffs catch unintended regressions in element extraction, metadata population, and serialization format without requiring hand-written assertions for every field.
Trade-off: Golden files are brittle to intentional changes (every format update requires baseline refresh). Large fixture directories increase repository size. Non-deterministic metadata fields (timestamps, UUIDs) must be excluded or normalized before comparison.

Reasoning

Document processing pipelines produce complex structured output with many fields. Writing individual assertions for every element, metadata field, and edge case is impractical. By committing full expected output and using diff-based comparison, the test suite achieves comprehensive regression coverage with minimal assertion code. The OVERWRITE_FIXTURES escape hatch provides a controlled mechanism for updating baselines after intentional changes, while the diff report makes it easy to review exactly what changed.

Code Evidence

Recursive diff comparison (check-diff-expected-output.sh):

# check-diff-expected-output.sh - compare actual vs expected output
OUTPUT_DIR="structured-output/$1"
EXPECTED_DIR="expected-structured-output/$1"

if [ "$OVERWRITE_FIXTURES" == "true" ]; then
    cp -rf "$SCRIPT_DIR/$OUTPUT_DIR/." "$SCRIPT_DIR/$EXPECTED_DIR/"
    exit 0
fi

DIFF=$(diff -ruN "$SCRIPT_DIR/$EXPECTED_DIR" "$SCRIPT_DIR/$OUTPUT_DIR")
if [ -n "$DIFF" ]; then
    echo "$DIFF" | diffstat
    echo "$DIFF"
    exit 1
fi

File count validation (check-num-files-output.sh):

# check-num-files-output.sh - verify expected number of output files
EXPECTED_COUNT=$1
OUTPUT_DIR="structured-output/$2"
ACTUAL_COUNT=$(find "$SCRIPT_DIR/$OUTPUT_DIR" -type f | wc -l)
if [ "$ACTUAL_COUNT" -ne "$EXPECTED_COUNT" ]; then
    echo "Expected $EXPECTED_COUNT files, found $ACTUAL_COUNT"
    exit 1
fi

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment