Principle:Unstructured IO Unstructured Pipeline Output Validation
| Knowledge Sources | |
|---|---|
| Domains | Testing, Quality_Assurance, CI_CD |
| Last Updated | 2026-02-12 00:00 GMT |
Overview
A validation mechanism that verifies pipeline output correctness by comparing actual structured output against expected baseline fixtures using diff-based comparison.
Description
Pipeline output validation ensures that the ingest pipeline produces consistent, correct results across code changes. The approach uses golden file testing: known-good output files serve as baselines, and test runs compare actual output against these baselines using unified diff. Any discrepancy indicates a regression or intentional change that needs review.
Two complementary validation methods are used:
- Content diff: Compare actual JSON output files against expected baselines, producing unified diff reports when mismatches are found
- File count: Verify that the expected number of output files were produced, catching cases where files are missing or duplicated
Usage
Use this principle when establishing regression testing for document processing pipelines. It is the primary quality gate in the Unstructured CI/CD pipeline, running after every connector integration test to verify output stability. The OVERWRITE_FIXTURES mechanism allows intentional baseline updates when output format changes are expected.
Theoretical Basis
Golden file testing is a deterministic validation strategy:
# Abstract validation algorithm
def validate_output(actual_dir, expected_dir):
for file in expected_dir:
actual = read(actual_dir / file.name)
expected = read(expected_dir / file.name)
diff = unified_diff(expected, actual)
if diff:
report_failure(diff)
def validate_count(actual_dir, expected_count):
actual_count = count_files(actual_dir)
if actual_count != expected_count:
report_failure(f"Expected {expected_count}, got {actual_count}")
Key features:
- Deterministic output: Uses
OMP_THREAD_LIMIT=1to ensure reproducible results - Baseline update:
OVERWRITE_FIXTURES=trueoverwrites expected output for intentional changes - Diff reporting: Uses
diffstatfor summarized change reports