Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:DataExpert io Data engineer handbook SQL Verification Queries

From Leeroopedia


Overview

SQL Verification Queries are targeted SELECT statements designed to validate that a database has been correctly seeded and that subsequent data transformations produce expected results. In the Dimensional Data Modeling workflow, verification queries serve as the primary mechanism for confirming data integrity after database initialization, table creation, and cumulative or historical transformations.

Theoretical Foundation

Purpose of Verification Queries

Verification queries answer a fundamental question: Does the data in the database match our expectations? They bridge the gap between the mechanical act of loading data and the intellectual confidence that the data is correct, complete, and ready for analytical use.

Verification queries operate at multiple levels:

  • Schema verification -- Confirming that expected tables, columns, and constraints exist.
  • Row count verification -- Ensuring the number of records matches expected totals.
  • Data quality verification -- Checking for nulls, duplicates, orphaned records, and value range violations.
  • Transformation verification -- Validating that derived tables (cumulative tables, SCD tables) contain correctly computed results.

The Verification Mindset

Effective verification requires thinking adversarially about data. Rather than assuming correctness, the engineer should:

  • Enumerate assumptions -- What must be true about the data for downstream queries to work?
  • Design falsifiable checks -- Write queries that would return unexpected results if an assumption is violated.
  • Automate where possible -- Embed verification queries in CI/CD pipelines or post-seeding scripts.

Verification in Dimensional Modeling

Dimensional data modeling introduces specific verification challenges that go beyond simple row counts.

The Actors Table

When building a cumulative actors table from the base actor_films dataset, verification queries should confirm:

  • Every actor from the source data appears in the target table.
  • The films array or nested structure contains the correct set of films for each actor.
  • The quality_class classification (e.g., star, good, average, bad) is derived correctly from the actor's average rating.
  • The is_active flag accurately reflects whether the actor had films in the current year.
-- Example: Verify actor count matches source
SELECT COUNT(DISTINCT actor) AS source_actors FROM actor_films;
SELECT COUNT(*) AS target_actors FROM actors WHERE current_year = 2021;

SCD Type 2 History

Slowly Changing Dimension (SCD) Type 2 tables track historical changes by maintaining multiple rows per entity, each with a validity period. Verification queries for SCD Type 2 tables should check:

  • No overlapping periods -- For a given actor, no two records should have overlapping start_date and end_date ranges.
  • Contiguous history -- The end_date of one record should match the start_date of the next record for the same entity.
  • Current record identification -- Exactly one record per entity should have an end_date of NULL or a sentinel value (e.g., 9999-12-31) representing the current state.
  • Change detection accuracy -- Transitions in quality_class or is_active should be accurately reflected as new SCD rows.
-- Example: Verify no overlapping SCD periods for any actor
SELECT actorid, COUNT(*)
FROM actors_history_scd a1
JOIN actors_history_scd a2
  ON a1.actorid = a2.actorid
  AND a1.start_date < a2.end_date
  AND a1.end_date > a2.start_date
  AND a1.start_date <> a2.start_date
GROUP BY actorid
HAVING COUNT(*) > 0;

Cumulative Table Generation

Cumulative tables aggregate data across time periods. Verification queries should confirm:

  • Monotonic growth -- The cumulative film count for an actor should never decrease from one year to the next.
  • Correct aggregation -- The films array for year Y should contain all films from years up to and including Y.
  • Year coverage -- Every year in the source data range should be represented in the cumulative table.
-- Example: Verify cumulative film count is non-decreasing
SELECT actor, current_year, array_length(films, 1) AS film_count
FROM actors
ORDER BY actor, current_year;

When to Apply Verification Queries

Verification queries should be executed at the following points in the workflow:

Workflow Stage Verification Focus
After database seeding Confirm base tables exist and contain expected row counts. Validate column types and constraints.
After table creation Verify new tables (e.g., actors, actors_history_scd) have the correct schema and initial data.
After cumulative generation Check that cumulative aggregations are correct and complete across all time periods.
After SCD backfill Validate historical SCD records for completeness, contiguity, and non-overlap.
Before analytical queries Run a final integrity check to ensure the data supports the intended analysis.

Best Practices

  • Start with simple counts -- A mismatch in row counts is often the fastest indicator of a seeding or transformation problem.
  • Use EXCEPT and INTERSECT -- Set operations are powerful tools for comparing expected vs. actual results.
  • Document expected values -- Record the expected output of key verification queries so that future runs can be compared.
  • Fail loudly -- In automated pipelines, verification queries should raise errors (not just warnings) when checks fail.
  • Version your verification queries -- As the schema evolves, verification queries must evolve in tandem.

Related Pages

Metadata

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment