Implementation:DataTalksClub Data engineering zoomcamp Dbt Intermediate Models
| Page Metadata | |
|---|---|
| Knowledge Sources | repo: DataTalksClub/data-engineering-zoomcamp, dbt docs: dbt intermediate layer reference |
| Domains | Analytics Engineering, Intermediate Models, Data Integration |
| Last Updated | 2026-02-09 14:00 GMT |
Overview
Concrete dbt models implementing the intermediate transformation layer, combining green and yellow taxi staging data via UNION ALL, generating surrogate keys, deduplicating records, and enriching with payment type descriptions from seed data.
Description
The intermediate layer consists of two models that progressively integrate and clean the staged taxi trip data:
int_trips_unioned(20 output columns): Unionsstg_green_tripdataandstg_yellow_tripdatainto a single dataset. Harmonizes schema differences by adding default values for columns missing in yellow taxi data (trip_type=1,ehail_fee=0). Adds aservice_typediscriminator column ('Green'or'Yellow').
int_trips(21 output columns): Generates a surrogatetrip_idfrom business keys usingdbt_utils.generate_surrogate_key, joins with thepayment_type_lookupseed for enrichment, and deduplicates usingQUALIFY ROW_NUMBER().
Usage
These models are consumed by the marts layer via Template:Ref('int trips'). They should be rebuilt whenever staging data changes. Run with dbt run --select intermediate.
Code Reference
Source Location
04-analytics-engineering/taxi_rides_ny/models/intermediate/int_trips_unioned.sql(Lines 1-56)04-analytics-engineering/taxi_rides_ny/models/intermediate/int_trips.sql(Lines 1-62)04-analytics-engineering/taxi_rides_ny/models/intermediate/schema.yml(Lines 1-107)
Signature: int_trips_unioned
-- Union green and yellow taxi data into a single dataset
-- Demonstrates how to combine data from multiple sources with slightly different schemas
with green_trips as (
select
vendor_id,
rate_code_id,
pickup_location_id,
dropoff_location_id,
pickup_datetime,
dropoff_datetime,
store_and_fwd_flag,
passenger_count,
trip_distance,
trip_type,
fare_amount,
extra,
mta_tax,
tip_amount,
tolls_amount,
ehail_fee,
improvement_surcharge,
total_amount,
payment_type,
'Green' as service_type
from {{ ref('stg_green_tripdata') }}
),
yellow_trips as (
select
vendor_id,
rate_code_id,
pickup_location_id,
dropoff_location_id,
pickup_datetime,
dropoff_datetime,
store_and_fwd_flag,
passenger_count,
trip_distance,
cast(1 as integer) as trip_type, -- Yellow taxis only do street-hail (code 1)
fare_amount,
extra,
mta_tax,
tip_amount,
tolls_amount,
cast(0 as numeric) as ehail_fee, -- Yellow taxis don't have ehail_fee
improvement_surcharge,
total_amount,
payment_type,
'Yellow' as service_type
from {{ ref('stg_yellow_tripdata') }}
)
select * from green_trips
union all
select * from yellow_trips
Signature: int_trips
-- Enrich and deduplicate trip data
-- Demonstrates enrichment and surrogate key generation
with unioned as (
select * from {{ ref('int_trips_unioned') }}
),
payment_types as (
select * from {{ ref('payment_type_lookup') }}
),
cleaned_and_enriched as (
select
-- Generate unique trip identifier (surrogate key pattern)
{{ dbt_utils.generate_surrogate_key(
['u.vendor_id', 'u.pickup_datetime', 'u.pickup_location_id', 'u.service_type']
) }} as trip_id,
-- Identifiers
u.vendor_id,
u.service_type,
u.rate_code_id,
-- Location IDs
u.pickup_location_id,
u.dropoff_location_id,
-- Timestamps
u.pickup_datetime,
u.dropoff_datetime,
-- Trip details
u.store_and_fwd_flag,
u.passenger_count,
u.trip_distance,
u.trip_type,
-- Payment breakdown
u.fare_amount,
u.extra,
u.mta_tax,
u.tip_amount,
u.tolls_amount,
u.ehail_fee,
u.improvement_surcharge,
u.total_amount,
-- Enrich with payment type description
coalesce(u.payment_type, 0) as payment_type,
coalesce(pt.description, 'Unknown') as payment_type_description
from unioned u
left join payment_types pt
on coalesce(u.payment_type, 0) = pt.payment_type
)
select * from cleaned_and_enriched
-- Deduplicate: if multiple trips match (same vendor, second, location, service), keep first
qualify row_number() over(
partition by vendor_id, pickup_datetime, pickup_location_id, service_type
order by dropoff_datetime
) = 1
Import
External dependency required for surrogate key generation:
# packages.yml
packages:
- package: dbt-labs/dbt_utils
version: [">=1.3.0", "<2.0.0"]
dbt deps
Materialization is inherited from dbt_project.yml:
models:
taxi_rides_ny:
intermediate:
+materialized: table
I/O Contract
Inputs
| Input | Type | Description |
|---|---|---|
Template:Ref('stg green tripdata') |
Staging view | Standardized green taxi trips (20 columns) |
Template:Ref('stg yellow tripdata') |
Staging view | Standardized yellow taxi trips (18 columns) |
Template:Ref('payment type lookup') |
Seed table | Payment type codes mapped to descriptions (payment_type, description) |
dbt_utils.generate_surrogate_key |
Macro (dbt_utils) | Generates deterministic MD5 hash from input columns |
Outputs
| Output | Type | Columns | Description |
|---|---|---|---|
int_trips_unioned |
Table (20 columns) | vendor_id, rate_code_id, pickup_location_id, dropoff_location_id, pickup_datetime, dropoff_datetime, store_and_fwd_flag, passenger_count, trip_distance, trip_type, fare_amount, extra, mta_tax, tip_amount, tolls_amount, ehail_fee, improvement_surcharge, total_amount, payment_type, service_type | Unified green and yellow taxi trips with harmonized schema |
int_trips |
Table (21 columns) | trip_id, vendor_id, service_type, rate_code_id, pickup_location_id, dropoff_location_id, pickup_datetime, dropoff_datetime, store_and_fwd_flag, passenger_count, trip_distance, trip_type, fare_amount, extra, mta_tax, tip_amount, tolls_amount, ehail_fee, improvement_surcharge, total_amount, payment_type, payment_type_description | Deduplicated, enriched trip data with surrogate key and payment descriptions |
Usage Examples
Building the intermediate layer
# Build only intermediate models
dbt run --select intermediate
# Build intermediate and all upstream dependencies
dbt run --select +intermediate
# Test intermediate models (checks unique, not_null, accepted_values on int_trips)
dbt test --select intermediate
Schema harmonization pattern for UNION ALL
-- Yellow taxis lack trip_type and ehail_fee columns.
-- The intermediate layer fills these with sensible defaults:
yellow_trips as (
select
vendor_id,
-- ... shared columns ...
cast(1 as integer) as trip_type, -- Yellow taxis only do street-hail (code 1)
-- ... shared columns ...
cast(0 as numeric) as ehail_fee, -- Yellow taxis don't have ehail_fee
-- ... remaining columns ...
'Yellow' as service_type
from {{ ref('stg_yellow_tripdata') }}
)
Surrogate key and deduplication pattern
-- Generate a deterministic surrogate key from business columns:
{{ dbt_utils.generate_surrogate_key(
['u.vendor_id', 'u.pickup_datetime', 'u.pickup_location_id', 'u.service_type']
) }} as trip_id
-- Then deduplicate using the same partition columns:
qualify row_number() over(
partition by vendor_id, pickup_datetime, pickup_location_id, service_type
order by dropoff_datetime
) = 1
Related Pages
- Principle:DataTalksClub_Data_engineering_zoomcamp_Dbt_Intermediate_Layer
- Implementation:DataTalksClub_Data_engineering_zoomcamp_Dbt_Staging_Models
- Implementation:DataTalksClub_Data_engineering_zoomcamp_Dbt_Marts_Models
- Heuristic:DataTalksClub_Data_engineering_zoomcamp_DuckDB_OOM_Memory_Management
- Environment:DataTalksClub_Data_engineering_zoomcamp_Dbt_DuckDB_Environment