Principle:DataTalksClub Data engineering zoomcamp Dbt Intermediate Layer
| Page Metadata | |
|---|---|
| Knowledge Sources | dbt intermediate layer best practices, analytics engineering methodology |
| Domains | Analytics Engineering, Data Modeling, Intermediate Layer Design |
| Last Updated | 2026-02-09 14:00 GMT |
Overview
The intermediate transformation layer combines, deduplicates, and enriches data from staging models, serving as the bridge between raw-standardized data and business-facing marts.
Description
In a layered transformation architecture, the intermediate layer occupies the space between staging (which only standardizes) and marts (which only serve business needs). Intermediate models perform the heavy lifting of data integration:
- Union operations: Combining multiple staging models that represent the same entity from different source systems (e.g., green and yellow taxi trips) into a single, unified dataset with a discriminator column (e.g.,
service_type). - Schema harmonization: When source schemas differ (e.g., green taxis have
trip_typeandehail_feebut yellow taxis do not), intermediate models fill missing columns with sensible defaults to produce a uniform schema. - Surrogate key generation: Creating stable, unique identifiers from business key combinations using deterministic hashing, which serves as the primary key for downstream fact tables.
- Deduplication: Removing duplicate records using window functions (e.g.,
QUALIFY ROW_NUMBER()) partitioned by the same business key combination used for surrogate key generation. - Enrichment via seed lookups: Joining with seed data (static reference tables loaded from CSV files) to add descriptive labels such as payment type descriptions.
Intermediate models are materialized as tables because they persist the results of potentially expensive union and deduplication operations, preventing downstream models from re-computing these transformations on every query.
Usage
Use the intermediate layer pattern when:
- Multiple staging models need to be combined into a single dataset.
- Source schemas differ and require harmonization before union.
- Surrogate keys must be generated from composite business keys.
- Deduplication logic is needed to ensure row-level uniqueness.
- Static reference data (seeds) needs to be joined for enrichment.
- The same intermediate result is consumed by multiple marts models.
Theoretical Basis
Union with Schema Harmonization
When combining datasets with different schemas, the intermediate layer must map each source to a common set of columns:
function union_with_harmonization(sources[]):
target_schema = union_of_all_columns(sources)
for each source in sources:
for each column in target_schema:
if column exists in source:
select column
else:
select DEFAULT_VALUE as column -- e.g., cast(1 as integer) as trip_type
add discriminator column -- e.g., 'Green' as service_type
return UNION ALL of all transformed sources
This pattern preserves all information from each source while producing a single, queryable dataset.
Surrogate Key Generation
Surrogate keys are deterministic hashes of business key columns, providing stable identifiers independent of source system IDs:
function generate_surrogate_key(columns[]):
-- Concatenate column values with a separator
concatenated = join(columns, separator='||')
-- Apply deterministic hash (e.g., MD5)
return hash(concatenated)
-- Example: trip_id = hash("vendor_id || pickup_datetime || pickup_location_id || service_type")
The key property of surrogate keys is idempotency: the same input row always produces the same surrogate key, enabling incremental processing and deduplication.
Deduplication with QUALIFY
The QUALIFY clause (supported by modern SQL engines) filters window function results without requiring a wrapping subquery:
function deduplicate(dataset, partition_columns, order_column):
return:
SELECT *
FROM dataset
QUALIFY ROW_NUMBER() OVER(
PARTITION BY partition_columns
ORDER BY order_column
) = 1
This pattern keeps the first record (by the ordering column) for each unique combination of partition columns, effectively removing duplicates while preserving the most relevant record.
Enrichment via Seed Joins
Seeds are static lookup tables loaded from CSV files. Enrichment joins these lookups to add human-readable labels:
function enrich_with_seed(dataset, seed_table, join_key):
return:
SELECT
d.*,
COALESCE(s.description, 'Unknown') as enriched_column
FROM dataset d
LEFT JOIN seed_table s ON d.join_key = s.join_key
The LEFT JOIN with COALESCE ensures that records without a matching seed entry still appear in the output with a default value.
Related Pages
- Implementation:DataTalksClub_Data_engineering_zoomcamp_Dbt_Intermediate_Models
- Principle:DataTalksClub_Data_engineering_zoomcamp_Dbt_Staging_Layer
- Principle:DataTalksClub_Data_engineering_zoomcamp_Dbt_Marts_Layer
- Principle:DataTalksClub_Data_engineering_zoomcamp_Dbt_Project_Configuration
- Heuristic:DataTalksClub_Data_engineering_zoomcamp_DuckDB_OOM_Memory_Management