Implementation:DataExpert io Data engineer handbook SQL Select Query Pattern

Overview

This page documents the SQL SELECT query patterns used in the Dimensional Data Modeling workflow's Week 1 homework exercises. These patterns cover table creation, cumulative table generation, SCD Type 2 history tracking, and backfill queries -- all operating against the actor_films dataset.

Type

Pattern Doc

Source

Repository: https://github.com/DataExpert-io/data-engineer-handbook
File: intermediate-bootcamp/materials/1-dimensional-data-modeling/homework/homework.md (Week 1 Exercises)

Interface

All queries operate against the actor_films base table with the following schema:

Column	Type	Description
`actor`	TEXT	The name of the actor.
`actorid`	TEXT	A unique identifier for the actor.
`film`	TEXT	The title of the film.
`year`	INTEGER	The release year of the film.
`votes`	INTEGER	The number of votes the film received.
`rating`	NUMERIC	The average rating of the film.
`filmid`	TEXT	A unique identifier for the film.

Query Patterns

Pattern 1: CREATE TABLE -- Actors Cumulative Table

This DDL statement defines the actors table used for cumulative modeling. The table tracks each actor's film history, quality classification, and active status across years.

CREATE TABLE actors (
    actor         TEXT,
    actorid       TEXT,
    films         film_info[],     -- Array of composite type holding film details
    quality_class quality_rating,  -- Enum: 'star', 'good', 'average', 'bad'
    is_active     BOOLEAN,         -- Whether the actor had films in the current year
    current_year  INTEGER,         -- The year this row represents
    PRIMARY KEY (actorid, current_year)
);

Key design decisions:

The films column uses a PostgreSQL array of a composite type (film_info) to store the full list of an actor's films cumulatively.
The quality_class column uses a custom enum type derived from the actor's average rating across all their films.
The primary key is a composite of actorid and current_year, enabling one row per actor per year.

Pattern 2: Cumulative Table Generation

This pattern populates the actors table by iterating year-by-year, carrying forward the cumulative film list and recomputing derived fields.

INSERT INTO actors
WITH yesterday AS (
    SELECT * FROM actors
    WHERE current_year = 2000  -- Previous year
),
today AS (
    SELECT
        actor,
        actorid,
        ARRAY_AGG(ROW(film, votes, rating, filmid)::film_info) AS films,
        AVG(rating) AS avg_rating,
        year
    FROM actor_films
    WHERE year = 2001  -- Current year
    GROUP BY actor, actorid, year
)
SELECT
    COALESCE(t.actor, y.actor) AS actor,
    COALESCE(t.actorid, y.actorid) AS actorid,
    CASE
        WHEN y.films IS NULL THEN t.films
        WHEN t.films IS NOT NULL THEN y.films || t.films
        ELSE y.films
    END AS films,
    CASE
        WHEN t.avg_rating > 8 THEN 'star'
        WHEN t.avg_rating > 7 THEN 'good'
        WHEN t.avg_rating > 6 THEN 'average'
        ELSE 'bad'
    END::quality_rating AS quality_class,
    t.year IS NOT NULL AS is_active,
    COALESCE(t.year, y.current_year + 1) AS current_year
FROM today t
FULL OUTER JOIN yesterday y
    ON t.actorid = y.actorid;

How it works:

The yesterday CTE retrieves the previous year's cumulative state from the actors table.
The today CTE aggregates the current year's new films from actor_films.
A FULL OUTER JOIN merges both sets, handling actors who appear in only one of the two years.
The films array is extended by concatenating yesterday's array with today's new entries.
The quality_class is recomputed based on the current year's average rating.
The is_active flag is set based on whether the actor had films in the current year.

Pattern 3: Actors History SCD (Type 2)

This pattern creates and populates the actors_history_scd table, which tracks changes in quality_class and is_active over time using SCD Type 2 methodology.

CREATE TABLE actors_history_scd (
    actorid       TEXT,
    actor         TEXT,
    quality_class quality_rating,
    is_active     BOOLEAN,
    start_date    INTEGER,  -- Start year of this SCD period
    end_date      INTEGER,  -- End year of this SCD period
    current_year  INTEGER,
    PRIMARY KEY (actorid, start_date)
);

The SCD generation query uses change detection logic:

WITH streak_identification AS (
    SELECT
        actorid,
        actor,
        quality_class,
        is_active,
        current_year,
        SUM(change_flag) OVER (
            PARTITION BY actorid ORDER BY current_year
        ) AS streak_id
    FROM (
        SELECT
            actorid,
            actor,
            quality_class,
            is_active,
            current_year,
            CASE
                WHEN quality_class <> LAG(quality_class) OVER (
                    PARTITION BY actorid ORDER BY current_year
                ) OR is_active <> LAG(is_active) OVER (
                    PARTITION BY actorid ORDER BY current_year
                ) OR LAG(quality_class) OVER (
                    PARTITION BY actorid ORDER BY current_year
                ) IS NULL
                THEN 1
                ELSE 0
            END AS change_flag
        FROM actors
    ) flagged
)
SELECT
    actorid,
    actor,
    quality_class,
    is_active,
    MIN(current_year) AS start_date,
    MAX(current_year) AS end_date,
    2021 AS current_year  -- The processing year
FROM streak_identification
GROUP BY actorid, actor, quality_class, is_active, streak_id;

How it works:

The inner subquery uses LAG() to detect when quality_class or is_active changes from one year to the next.
Each change increments a change_flag, which is then cumulatively summed to create a streak_id.
The outer query groups by streak_id to collapse consecutive unchanged years into a single SCD row with start_date and end_date.

Pattern 4: Backfill Query

Backfill queries populate the cumulative actors table across the entire historical range in a single operation, rather than processing one year at a time.

-- Generate the full year range
WITH years AS (
    SELECT generate_series(
        (SELECT MIN(year) FROM actor_films),
        (SELECT MAX(year) FROM actor_films)
    ) AS year
),
-- Recursive or iterative cumulative build
cumulative AS (
    SELECT
        af.actorid,
        af.actor,
        y.year AS current_year,
        ARRAY_AGG(
            ROW(af.film, af.votes, af.rating, af.filmid)::film_info
        ) FILTER (WHERE af.year <= y.year) AS films
    FROM actor_films af
    CROSS JOIN years y
    WHERE af.year <= y.year
    GROUP BY af.actorid, af.actor, y.year
)
SELECT * FROM cumulative;

Purpose: Backfill queries are used to initialize the cumulative table from scratch when the table is empty or needs to be rebuilt. They compute the full historical state in one pass rather than requiring iterative year-by-year insertion.

Examples Summary

Pattern	Purpose	Key SQL Features
CREATE TABLE actors	Define cumulative table schema	Composite types, arrays, enums, composite PK
Cumulative generation	Year-over-year film accumulation	CTEs, FULL OUTER JOIN, ARRAY_AGG, COALESCE
SCD Type 2 history	Track attribute changes over time	LAG(), window functions, change detection, streak grouping
Backfill	Initialize full history in one pass	generate_series, CROSS JOIN, FILTER clause

Related Pages

Principle:DataExpert_io_Data_engineer_handbook_SQL_Verification_Queries -- The verification principles that these query patterns help implement.
Implementation:DataExpert_io_Data_engineer_handbook_Pg_restore_Init_Script -- The initialization script that seeds the base actor_films data these queries operate on.
Principle:DataExpert_io_Data_engineer_handbook_Database_Seeding -- The broader seeding process that provides the foundational dataset.
Environment:DataExpert_io_Data_engineer_handbook_PostgreSQL_Docker_Environment

Metadata

Knowledge Sources: Data Engineer Handbook
Domains: Data_Engineering, SQL, Infrastructure
Last Updated: 2026-02-09 06:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment