Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:DataTalksClub Data engineering zoomcamp Dbt Intermediate Models

From Leeroopedia


Page Metadata
Knowledge Sources repo: DataTalksClub/data-engineering-zoomcamp, dbt docs: dbt intermediate layer reference
Domains Analytics Engineering, Intermediate Models, Data Integration
Last Updated 2026-02-09 14:00 GMT

Overview

Concrete dbt models implementing the intermediate transformation layer, combining green and yellow taxi staging data via UNION ALL, generating surrogate keys, deduplicating records, and enriching with payment type descriptions from seed data.

Description

The intermediate layer consists of two models that progressively integrate and clean the staged taxi trip data:

  • int_trips_unioned (20 output columns): Unions stg_green_tripdata and stg_yellow_tripdata into a single dataset. Harmonizes schema differences by adding default values for columns missing in yellow taxi data (trip_type=1, ehail_fee=0). Adds a service_type discriminator column ('Green' or 'Yellow').
  • int_trips (21 output columns): Generates a surrogate trip_id from business keys using dbt_utils.generate_surrogate_key, joins with the payment_type_lookup seed for enrichment, and deduplicates using QUALIFY ROW_NUMBER().

Usage

These models are consumed by the marts layer via Template:Ref('int trips'). They should be rebuilt whenever staging data changes. Run with dbt run --select intermediate.

Code Reference

Source Location

  • 04-analytics-engineering/taxi_rides_ny/models/intermediate/int_trips_unioned.sql (Lines 1-56)
  • 04-analytics-engineering/taxi_rides_ny/models/intermediate/int_trips.sql (Lines 1-62)
  • 04-analytics-engineering/taxi_rides_ny/models/intermediate/schema.yml (Lines 1-107)

Signature: int_trips_unioned

-- Union green and yellow taxi data into a single dataset
-- Demonstrates how to combine data from multiple sources with slightly different schemas

with green_trips as (
    select
        vendor_id,
        rate_code_id,
        pickup_location_id,
        dropoff_location_id,
        pickup_datetime,
        dropoff_datetime,
        store_and_fwd_flag,
        passenger_count,
        trip_distance,
        trip_type,
        fare_amount,
        extra,
        mta_tax,
        tip_amount,
        tolls_amount,
        ehail_fee,
        improvement_surcharge,
        total_amount,
        payment_type,
        'Green' as service_type
    from {{ ref('stg_green_tripdata') }}
),

yellow_trips as (
    select
        vendor_id,
        rate_code_id,
        pickup_location_id,
        dropoff_location_id,
        pickup_datetime,
        dropoff_datetime,
        store_and_fwd_flag,
        passenger_count,
        trip_distance,
        cast(1 as integer) as trip_type,  -- Yellow taxis only do street-hail (code 1)
        fare_amount,
        extra,
        mta_tax,
        tip_amount,
        tolls_amount,
        cast(0 as numeric) as ehail_fee,  -- Yellow taxis don't have ehail_fee
        improvement_surcharge,
        total_amount,
        payment_type,
        'Yellow' as service_type
    from {{ ref('stg_yellow_tripdata') }}
)

select * from green_trips
union all
select * from yellow_trips

Signature: int_trips

-- Enrich and deduplicate trip data
-- Demonstrates enrichment and surrogate key generation

with unioned as (
    select * from {{ ref('int_trips_unioned') }}
),

payment_types as (
    select * from {{ ref('payment_type_lookup') }}
),

cleaned_and_enriched as (
    select
        -- Generate unique trip identifier (surrogate key pattern)
        {{ dbt_utils.generate_surrogate_key(
            ['u.vendor_id', 'u.pickup_datetime', 'u.pickup_location_id', 'u.service_type']
        ) }} as trip_id,

        -- Identifiers
        u.vendor_id,
        u.service_type,
        u.rate_code_id,

        -- Location IDs
        u.pickup_location_id,
        u.dropoff_location_id,

        -- Timestamps
        u.pickup_datetime,
        u.dropoff_datetime,

        -- Trip details
        u.store_and_fwd_flag,
        u.passenger_count,
        u.trip_distance,
        u.trip_type,

        -- Payment breakdown
        u.fare_amount,
        u.extra,
        u.mta_tax,
        u.tip_amount,
        u.tolls_amount,
        u.ehail_fee,
        u.improvement_surcharge,
        u.total_amount,

        -- Enrich with payment type description
        coalesce(u.payment_type, 0) as payment_type,
        coalesce(pt.description, 'Unknown') as payment_type_description

    from unioned u
    left join payment_types pt
        on coalesce(u.payment_type, 0) = pt.payment_type
)

select * from cleaned_and_enriched

-- Deduplicate: if multiple trips match (same vendor, second, location, service), keep first
qualify row_number() over(
    partition by vendor_id, pickup_datetime, pickup_location_id, service_type
    order by dropoff_datetime
) = 1

Import

External dependency required for surrogate key generation:

# packages.yml
packages:
  - package: dbt-labs/dbt_utils
    version: [">=1.3.0", "<2.0.0"]
dbt deps

Materialization is inherited from dbt_project.yml:

models:
  taxi_rides_ny:
    intermediate:
      +materialized: table

I/O Contract

Inputs

Input Type Description
Template:Ref('stg green tripdata') Staging view Standardized green taxi trips (20 columns)
Template:Ref('stg yellow tripdata') Staging view Standardized yellow taxi trips (18 columns)
Template:Ref('payment type lookup') Seed table Payment type codes mapped to descriptions (payment_type, description)
dbt_utils.generate_surrogate_key Macro (dbt_utils) Generates deterministic MD5 hash from input columns

Outputs

Output Type Columns Description
int_trips_unioned Table (20 columns) vendor_id, rate_code_id, pickup_location_id, dropoff_location_id, pickup_datetime, dropoff_datetime, store_and_fwd_flag, passenger_count, trip_distance, trip_type, fare_amount, extra, mta_tax, tip_amount, tolls_amount, ehail_fee, improvement_surcharge, total_amount, payment_type, service_type Unified green and yellow taxi trips with harmonized schema
int_trips Table (21 columns) trip_id, vendor_id, service_type, rate_code_id, pickup_location_id, dropoff_location_id, pickup_datetime, dropoff_datetime, store_and_fwd_flag, passenger_count, trip_distance, trip_type, fare_amount, extra, mta_tax, tip_amount, tolls_amount, ehail_fee, improvement_surcharge, total_amount, payment_type, payment_type_description Deduplicated, enriched trip data with surrogate key and payment descriptions

Usage Examples

Building the intermediate layer

# Build only intermediate models
dbt run --select intermediate

# Build intermediate and all upstream dependencies
dbt run --select +intermediate

# Test intermediate models (checks unique, not_null, accepted_values on int_trips)
dbt test --select intermediate

Schema harmonization pattern for UNION ALL

-- Yellow taxis lack trip_type and ehail_fee columns.
-- The intermediate layer fills these with sensible defaults:
yellow_trips as (
    select
        vendor_id,
        -- ... shared columns ...
        cast(1 as integer) as trip_type,  -- Yellow taxis only do street-hail (code 1)
        -- ... shared columns ...
        cast(0 as numeric) as ehail_fee,  -- Yellow taxis don't have ehail_fee
        -- ... remaining columns ...
        'Yellow' as service_type
    from {{ ref('stg_yellow_tripdata') }}
)

Surrogate key and deduplication pattern

-- Generate a deterministic surrogate key from business columns:
{{ dbt_utils.generate_surrogate_key(
    ['u.vendor_id', 'u.pickup_datetime', 'u.pickup_location_id', 'u.service_type']
) }} as trip_id

-- Then deduplicate using the same partition columns:
qualify row_number() over(
    partition by vendor_id, pickup_datetime, pickup_location_id, service_type
    order by dropoff_datetime
) = 1

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment