Principle:DataExpert io Data engineer handbook SQL Verification Queries
Overview
SQL Verification Queries are targeted SELECT statements designed to validate that a database has been correctly seeded and that subsequent data transformations produce expected results. In the Dimensional Data Modeling workflow, verification queries serve as the primary mechanism for confirming data integrity after database initialization, table creation, and cumulative or historical transformations.
Theoretical Foundation
Purpose of Verification Queries
Verification queries answer a fundamental question: Does the data in the database match our expectations? They bridge the gap between the mechanical act of loading data and the intellectual confidence that the data is correct, complete, and ready for analytical use.
Verification queries operate at multiple levels:
- Schema verification -- Confirming that expected tables, columns, and constraints exist.
- Row count verification -- Ensuring the number of records matches expected totals.
- Data quality verification -- Checking for nulls, duplicates, orphaned records, and value range violations.
- Transformation verification -- Validating that derived tables (cumulative tables, SCD tables) contain correctly computed results.
The Verification Mindset
Effective verification requires thinking adversarially about data. Rather than assuming correctness, the engineer should:
- Enumerate assumptions -- What must be true about the data for downstream queries to work?
- Design falsifiable checks -- Write queries that would return unexpected results if an assumption is violated.
- Automate where possible -- Embed verification queries in CI/CD pipelines or post-seeding scripts.
Verification in Dimensional Modeling
Dimensional data modeling introduces specific verification challenges that go beyond simple row counts.
The Actors Table
When building a cumulative actors table from the base actor_films dataset, verification queries should confirm:
- Every actor from the source data appears in the target table.
- The films array or nested structure contains the correct set of films for each actor.
- The quality_class classification (e.g., star, good, average, bad) is derived correctly from the actor's average rating.
- The is_active flag accurately reflects whether the actor had films in the current year.
-- Example: Verify actor count matches source
SELECT COUNT(DISTINCT actor) AS source_actors FROM actor_films;
SELECT COUNT(*) AS target_actors FROM actors WHERE current_year = 2021;
SCD Type 2 History
Slowly Changing Dimension (SCD) Type 2 tables track historical changes by maintaining multiple rows per entity, each with a validity period. Verification queries for SCD Type 2 tables should check:
- No overlapping periods -- For a given actor, no two records should have overlapping
start_dateandend_dateranges. - Contiguous history -- The
end_dateof one record should match thestart_dateof the next record for the same entity. - Current record identification -- Exactly one record per entity should have an
end_dateof NULL or a sentinel value (e.g.,9999-12-31) representing the current state. - Change detection accuracy -- Transitions in
quality_classoris_activeshould be accurately reflected as new SCD rows.
-- Example: Verify no overlapping SCD periods for any actor
SELECT actorid, COUNT(*)
FROM actors_history_scd a1
JOIN actors_history_scd a2
ON a1.actorid = a2.actorid
AND a1.start_date < a2.end_date
AND a1.end_date > a2.start_date
AND a1.start_date <> a2.start_date
GROUP BY actorid
HAVING COUNT(*) > 0;
Cumulative Table Generation
Cumulative tables aggregate data across time periods. Verification queries should confirm:
- Monotonic growth -- The cumulative film count for an actor should never decrease from one year to the next.
- Correct aggregation -- The films array for year Y should contain all films from years up to and including Y.
- Year coverage -- Every year in the source data range should be represented in the cumulative table.
-- Example: Verify cumulative film count is non-decreasing
SELECT actor, current_year, array_length(films, 1) AS film_count
FROM actors
ORDER BY actor, current_year;
When to Apply Verification Queries
Verification queries should be executed at the following points in the workflow:
| Workflow Stage | Verification Focus |
|---|---|
| After database seeding | Confirm base tables exist and contain expected row counts. Validate column types and constraints. |
| After table creation | Verify new tables (e.g., actors, actors_history_scd) have the correct schema and initial data. |
| After cumulative generation | Check that cumulative aggregations are correct and complete across all time periods. |
| After SCD backfill | Validate historical SCD records for completeness, contiguity, and non-overlap. |
| Before analytical queries | Run a final integrity check to ensure the data supports the intended analysis. |
Best Practices
- Start with simple counts -- A mismatch in row counts is often the fastest indicator of a seeding or transformation problem.
- Use EXCEPT and INTERSECT -- Set operations are powerful tools for comparing expected vs. actual results.
- Document expected values -- Record the expected output of key verification queries so that future runs can be compared.
- Fail loudly -- In automated pipelines, verification queries should raise errors (not just warnings) when checks fail.
- Version your verification queries -- As the schema evolves, verification queries must evolve in tandem.
Related Pages
- Implementation:DataExpert_io_Data_engineer_handbook_SQL_Select_Query_Pattern
- Implementation:DataExpert_io_Data_engineer_handbook_SQL_Select_Query_Pattern -- Concrete SQL query patterns implementing these verification principles.
- Principle:DataExpert_io_Data_engineer_handbook_Database_Seeding -- The seeding process that verification queries validate.
- Implementation:DataExpert_io_Data_engineer_handbook_Pg_restore_Init_Script -- The initialization script whose output is verified by these queries.
Metadata
- Knowledge Sources: Data Engineer Handbook
- Domains: Data_Engineering, SQL, Infrastructure
- Last Updated: 2026-02-09 06:00 GMT