Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:DataTalksClub Data engineering zoomcamp Dbt Intermediate Layer

From Leeroopedia


Page Metadata
Knowledge Sources dbt intermediate layer best practices, analytics engineering methodology
Domains Analytics Engineering, Data Modeling, Intermediate Layer Design
Last Updated 2026-02-09 14:00 GMT

Overview

The intermediate transformation layer combines, deduplicates, and enriches data from staging models, serving as the bridge between raw-standardized data and business-facing marts.

Description

In a layered transformation architecture, the intermediate layer occupies the space between staging (which only standardizes) and marts (which only serve business needs). Intermediate models perform the heavy lifting of data integration:

  • Union operations: Combining multiple staging models that represent the same entity from different source systems (e.g., green and yellow taxi trips) into a single, unified dataset with a discriminator column (e.g., service_type).
  • Schema harmonization: When source schemas differ (e.g., green taxis have trip_type and ehail_fee but yellow taxis do not), intermediate models fill missing columns with sensible defaults to produce a uniform schema.
  • Surrogate key generation: Creating stable, unique identifiers from business key combinations using deterministic hashing, which serves as the primary key for downstream fact tables.
  • Deduplication: Removing duplicate records using window functions (e.g., QUALIFY ROW_NUMBER()) partitioned by the same business key combination used for surrogate key generation.
  • Enrichment via seed lookups: Joining with seed data (static reference tables loaded from CSV files) to add descriptive labels such as payment type descriptions.

Intermediate models are materialized as tables because they persist the results of potentially expensive union and deduplication operations, preventing downstream models from re-computing these transformations on every query.

Usage

Use the intermediate layer pattern when:

  • Multiple staging models need to be combined into a single dataset.
  • Source schemas differ and require harmonization before union.
  • Surrogate keys must be generated from composite business keys.
  • Deduplication logic is needed to ensure row-level uniqueness.
  • Static reference data (seeds) needs to be joined for enrichment.
  • The same intermediate result is consumed by multiple marts models.

Theoretical Basis

Union with Schema Harmonization

When combining datasets with different schemas, the intermediate layer must map each source to a common set of columns:

function union_with_harmonization(sources[]):
    target_schema = union_of_all_columns(sources)

    for each source in sources:
        for each column in target_schema:
            if column exists in source:
                select column
            else:
                select DEFAULT_VALUE as column  -- e.g., cast(1 as integer) as trip_type
        add discriminator column  -- e.g., 'Green' as service_type

    return UNION ALL of all transformed sources

This pattern preserves all information from each source while producing a single, queryable dataset.

Surrogate Key Generation

Surrogate keys are deterministic hashes of business key columns, providing stable identifiers independent of source system IDs:

function generate_surrogate_key(columns[]):
    -- Concatenate column values with a separator
    concatenated = join(columns, separator='||')
    -- Apply deterministic hash (e.g., MD5)
    return hash(concatenated)

-- Example: trip_id = hash("vendor_id || pickup_datetime || pickup_location_id || service_type")

The key property of surrogate keys is idempotency: the same input row always produces the same surrogate key, enabling incremental processing and deduplication.

Deduplication with QUALIFY

The QUALIFY clause (supported by modern SQL engines) filters window function results without requiring a wrapping subquery:

function deduplicate(dataset, partition_columns, order_column):
    return:
        SELECT *
        FROM dataset
        QUALIFY ROW_NUMBER() OVER(
            PARTITION BY partition_columns
            ORDER BY order_column
        ) = 1

This pattern keeps the first record (by the ordering column) for each unique combination of partition columns, effectively removing duplicates while preserving the most relevant record.

Enrichment via Seed Joins

Seeds are static lookup tables loaded from CSV files. Enrichment joins these lookups to add human-readable labels:

function enrich_with_seed(dataset, seed_table, join_key):
    return:
        SELECT
            d.*,
            COALESCE(s.description, 'Unknown') as enriched_column
        FROM dataset d
        LEFT JOIN seed_table s ON d.join_key = s.join_key

The LEFT JOIN with COALESCE ensures that records without a matching seed entry still appear in the output with a default value.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment