Implementation:DataExpert io Data engineer handbook SQL Select Query Pattern
Overview
This page documents the SQL SELECT query patterns used in the Dimensional Data Modeling workflow's Week 1 homework exercises. These patterns cover table creation, cumulative table generation, SCD Type 2 history tracking, and backfill queries -- all operating against the actor_films dataset.
Type
Pattern Doc
Source
- Repository: https://github.com/DataExpert-io/data-engineer-handbook
- File:
intermediate-bootcamp/materials/1-dimensional-data-modeling/homework/homework.md(Week 1 Exercises)
Interface
All queries operate against the actor_films base table with the following schema:
| Column | Type | Description |
|---|---|---|
actor |
TEXT | The name of the actor. |
actorid |
TEXT | A unique identifier for the actor. |
film |
TEXT | The title of the film. |
year |
INTEGER | The release year of the film. |
votes |
INTEGER | The number of votes the film received. |
rating |
NUMERIC | The average rating of the film. |
filmid |
TEXT | A unique identifier for the film. |
Query Patterns
Pattern 1: CREATE TABLE -- Actors Cumulative Table
This DDL statement defines the actors table used for cumulative modeling. The table tracks each actor's film history, quality classification, and active status across years.
CREATE TABLE actors (
actor TEXT,
actorid TEXT,
films film_info[], -- Array of composite type holding film details
quality_class quality_rating, -- Enum: 'star', 'good', 'average', 'bad'
is_active BOOLEAN, -- Whether the actor had films in the current year
current_year INTEGER, -- The year this row represents
PRIMARY KEY (actorid, current_year)
);
Key design decisions:
- The films column uses a PostgreSQL array of a composite type (
film_info) to store the full list of an actor's films cumulatively. - The quality_class column uses a custom enum type derived from the actor's average rating across all their films.
- The primary key is a composite of
actoridandcurrent_year, enabling one row per actor per year.
Pattern 2: Cumulative Table Generation
This pattern populates the actors table by iterating year-by-year, carrying forward the cumulative film list and recomputing derived fields.
INSERT INTO actors
WITH yesterday AS (
SELECT * FROM actors
WHERE current_year = 2000 -- Previous year
),
today AS (
SELECT
actor,
actorid,
ARRAY_AGG(ROW(film, votes, rating, filmid)::film_info) AS films,
AVG(rating) AS avg_rating,
year
FROM actor_films
WHERE year = 2001 -- Current year
GROUP BY actor, actorid, year
)
SELECT
COALESCE(t.actor, y.actor) AS actor,
COALESCE(t.actorid, y.actorid) AS actorid,
CASE
WHEN y.films IS NULL THEN t.films
WHEN t.films IS NOT NULL THEN y.films || t.films
ELSE y.films
END AS films,
CASE
WHEN t.avg_rating > 8 THEN 'star'
WHEN t.avg_rating > 7 THEN 'good'
WHEN t.avg_rating > 6 THEN 'average'
ELSE 'bad'
END::quality_rating AS quality_class,
t.year IS NOT NULL AS is_active,
COALESCE(t.year, y.current_year + 1) AS current_year
FROM today t
FULL OUTER JOIN yesterday y
ON t.actorid = y.actorid;
How it works:
- The yesterday CTE retrieves the previous year's cumulative state from the
actorstable. - The today CTE aggregates the current year's new films from
actor_films. - A FULL OUTER JOIN merges both sets, handling actors who appear in only one of the two years.
- The films array is extended by concatenating yesterday's array with today's new entries.
- The quality_class is recomputed based on the current year's average rating.
- The is_active flag is set based on whether the actor had films in the current year.
Pattern 3: Actors History SCD (Type 2)
This pattern creates and populates the actors_history_scd table, which tracks changes in quality_class and is_active over time using SCD Type 2 methodology.
CREATE TABLE actors_history_scd (
actorid TEXT,
actor TEXT,
quality_class quality_rating,
is_active BOOLEAN,
start_date INTEGER, -- Start year of this SCD period
end_date INTEGER, -- End year of this SCD period
current_year INTEGER,
PRIMARY KEY (actorid, start_date)
);
The SCD generation query uses change detection logic:
WITH streak_identification AS (
SELECT
actorid,
actor,
quality_class,
is_active,
current_year,
SUM(change_flag) OVER (
PARTITION BY actorid ORDER BY current_year
) AS streak_id
FROM (
SELECT
actorid,
actor,
quality_class,
is_active,
current_year,
CASE
WHEN quality_class <> LAG(quality_class) OVER (
PARTITION BY actorid ORDER BY current_year
) OR is_active <> LAG(is_active) OVER (
PARTITION BY actorid ORDER BY current_year
) OR LAG(quality_class) OVER (
PARTITION BY actorid ORDER BY current_year
) IS NULL
THEN 1
ELSE 0
END AS change_flag
FROM actors
) flagged
)
SELECT
actorid,
actor,
quality_class,
is_active,
MIN(current_year) AS start_date,
MAX(current_year) AS end_date,
2021 AS current_year -- The processing year
FROM streak_identification
GROUP BY actorid, actor, quality_class, is_active, streak_id;
How it works:
- The inner subquery uses
LAG()to detect whenquality_classoris_activechanges from one year to the next. - Each change increments a
change_flag, which is then cumulatively summed to create astreak_id. - The outer query groups by
streak_idto collapse consecutive unchanged years into a single SCD row withstart_dateandend_date.
Pattern 4: Backfill Query
Backfill queries populate the cumulative actors table across the entire historical range in a single operation, rather than processing one year at a time.
-- Generate the full year range
WITH years AS (
SELECT generate_series(
(SELECT MIN(year) FROM actor_films),
(SELECT MAX(year) FROM actor_films)
) AS year
),
-- Recursive or iterative cumulative build
cumulative AS (
SELECT
af.actorid,
af.actor,
y.year AS current_year,
ARRAY_AGG(
ROW(af.film, af.votes, af.rating, af.filmid)::film_info
) FILTER (WHERE af.year <= y.year) AS films
FROM actor_films af
CROSS JOIN years y
WHERE af.year <= y.year
GROUP BY af.actorid, af.actor, y.year
)
SELECT * FROM cumulative;
Purpose: Backfill queries are used to initialize the cumulative table from scratch when the table is empty or needs to be rebuilt. They compute the full historical state in one pass rather than requiring iterative year-by-year insertion.
Examples Summary
| Pattern | Purpose | Key SQL Features |
|---|---|---|
| CREATE TABLE actors | Define cumulative table schema | Composite types, arrays, enums, composite PK |
| Cumulative generation | Year-over-year film accumulation | CTEs, FULL OUTER JOIN, ARRAY_AGG, COALESCE |
| SCD Type 2 history | Track attribute changes over time | LAG(), window functions, change detection, streak grouping |
| Backfill | Initialize full history in one pass | generate_series, CROSS JOIN, FILTER clause |
Related Pages
- Principle:DataExpert_io_Data_engineer_handbook_SQL_Verification_Queries -- The verification principles that these query patterns help implement.
- Implementation:DataExpert_io_Data_engineer_handbook_Pg_restore_Init_Script -- The initialization script that seeds the base actor_films data these queries operate on.
- Principle:DataExpert_io_Data_engineer_handbook_Database_Seeding -- The broader seeding process that provides the foundational dataset.
- Environment:DataExpert_io_Data_engineer_handbook_PostgreSQL_Docker_Environment
Metadata
- Knowledge Sources: Data Engineer Handbook
- Domains: Data_Engineering, SQL, Infrastructure
- Last Updated: 2026-02-09 06:00 GMT