Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:DataTalksClub Data engineering zoomcamp Dbt Staging Models

From Leeroopedia


Page Metadata
Knowledge Sources repo: DataTalksClub/data-engineering-zoomcamp, dbt docs: dbt staging best practices
Domains Analytics Engineering, Staging Models, Data Standardization
Last Updated 2026-02-09 14:00 GMT

Overview

Concrete dbt models implementing the staging layer for NYC taxi trip data, standardizing raw green and yellow taxi records into a clean, consistently-named interface materialized as views.

Description

The staging layer consists of two models that each map one-to-one with a raw source table:

  • stg_green_tripdata (20 output columns): Standardizes raw green taxi data from source('raw', 'green_tripdata'). Renames green-taxi-specific prefixes (e.g., lpep_pickup_datetime to pickup_datetime) and includes green-only fields like trip_type and ehail_fee.
  • stg_yellow_tripdata (18 output columns): Standardizes raw yellow taxi data from source('raw', 'yellow_tripdata'). Renames yellow-taxi-specific prefixes (e.g., tpep_pickup_datetime to pickup_datetime). Does not include trip_type or ehail_fee (not present in yellow taxi data).

Both models follow the same CTE pattern (source, renamed, select), filter out null vendorid records, and apply a dev-environment date filter using project variables.

Usage

These models are the entry point for all downstream transformations. They are referenced by intermediate models via Template:Ref('stg green tripdata') and Template:Ref('stg yellow tripdata').

Code Reference

Source Location

  • 04-analytics-engineering/taxi_rides_ny/models/staging/stg_green_tripdata.sql (Lines 1-43)
  • 04-analytics-engineering/taxi_rides_ny/models/staging/stg_yellow_tripdata.sql (Lines 1-42)
  • 04-analytics-engineering/taxi_rides_ny/models/staging/schema.yml (Lines 1-95)

Signature: stg_green_tripdata

with source as (
    select * from {{ source('raw', 'green_tripdata') }}
),

renamed as (
    select
        -- identifiers
        cast(vendorid as integer) as vendor_id,
        {{ safe_cast('ratecodeid', 'integer') }} as rate_code_id,
        cast(pulocationid as integer) as pickup_location_id,
        cast(dolocationid as integer) as dropoff_location_id,

        -- timestamps
        cast(lpep_pickup_datetime as timestamp) as pickup_datetime,
        cast(lpep_dropoff_datetime as timestamp) as dropoff_datetime,

        -- trip info
        cast(store_and_fwd_flag as string) as store_and_fwd_flag,
        cast(passenger_count as integer) as passenger_count,
        cast(trip_distance as numeric) as trip_distance,
        {{ safe_cast('trip_type', 'integer') }} as trip_type,

        -- payment info
        cast(fare_amount as numeric) as fare_amount,
        cast(extra as numeric) as extra,
        cast(mta_tax as numeric) as mta_tax,
        cast(tip_amount as numeric) as tip_amount,
        cast(tolls_amount as numeric) as tolls_amount,
        cast(ehail_fee as numeric) as ehail_fee,
        cast(improvement_surcharge as numeric) as improvement_surcharge,
        cast(total_amount as numeric) as total_amount,
        {{ safe_cast('payment_type', 'integer') }} as payment_type
    from source
    -- Filter out records with null vendor_id (data quality requirement)
    where vendorid is not null
)

select * from renamed

-- Sample records for dev environment using deterministic date filter
{% if target.name == 'dev' %}
where pickup_datetime >= '2019-01-01' and pickup_datetime < '2019-02-01'
{% endif %}

Signature: stg_yellow_tripdata

with source as (
    select * from {{ source('raw', 'yellow_tripdata') }}
),

renamed as (
    select
        -- identifiers (standardized naming for consistency across yellow/green)
        cast(vendorid as integer) as vendor_id,
        cast(ratecodeid as integer) as rate_code_id,
        cast(pulocationid as integer) as pickup_location_id,
        cast(dolocationid as integer) as dropoff_location_id,

        -- timestamps (standardized naming)
        cast(tpep_pickup_datetime as timestamp) as pickup_datetime,
        cast(tpep_dropoff_datetime as timestamp) as dropoff_datetime,

        -- trip info
        cast(store_and_fwd_flag as string) as store_and_fwd_flag,
        cast(passenger_count as integer) as passenger_count,
        cast(trip_distance as numeric) as trip_distance,

        -- payment info
        cast(fare_amount as numeric) as fare_amount,
        cast(extra as numeric) as extra,
        cast(mta_tax as numeric) as mta_tax,
        cast(tip_amount as numeric) as tip_amount,
        cast(tolls_amount as numeric) as tolls_amount,
        cast(improvement_surcharge as numeric) as improvement_surcharge,
        cast(total_amount as numeric) as total_amount,
        cast(payment_type as integer) as payment_type
    from source
    -- Filter out records with null vendor_id (data quality requirement)
    where vendorid is not null
)

select * from renamed

-- Sample records for dev environment using deterministic date filter
{% if target.name == 'dev' %}
where pickup_datetime >= '2019-01-01' and pickup_datetime < '2019-02-01'
{% endif %}

Import

Both staging models rely on the safe_cast macro and the source declaration:

-- safe_cast macro (macros/safe_cast.sql)
{% macro safe_cast(column, data_type) %}
    {% if target.type == 'bigquery' %}
        safe_cast({{ column }} as {{ data_type }})
    {% else %}
        cast({{ column }} as {{ data_type }})
    {% endif %}
{% endmacro %}

Materialization is inherited from dbt_project.yml:

models:
  taxi_rides_ny:
    staging:
      +materialized: view

I/O Contract

Inputs

Input Type Description
source('raw', 'green_tripdata') Source table Raw green taxi trip records with 18 columns (lpep-prefixed timestamps)
source('raw', 'yellow_tripdata') Source table Raw yellow taxi trip records with 16 columns (tpep-prefixed timestamps)
target.name Runtime variable Environment name ('dev' or 'prod') controlling date filter
target.type Runtime variable Adapter type ('bigquery' or 'duckdb') controlling safe_cast behavior

Outputs

Output Type Columns Description
stg_green_tripdata View (20 columns) vendor_id, rate_code_id, pickup_location_id, dropoff_location_id, pickup_datetime, dropoff_datetime, store_and_fwd_flag, passenger_count, trip_distance, trip_type, fare_amount, extra, mta_tax, tip_amount, tolls_amount, ehail_fee, improvement_surcharge, total_amount, payment_type Standardized green taxi trips with consistent column names and types
stg_yellow_tripdata View (18 columns) vendor_id, rate_code_id, pickup_location_id, dropoff_location_id, pickup_datetime, dropoff_datetime, store_and_fwd_flag, passenger_count, trip_distance, fare_amount, extra, mta_tax, tip_amount, tolls_amount, improvement_surcharge, total_amount, payment_type Standardized yellow taxi trips (no trip_type or ehail_fee)

Usage Examples

Referencing a staging model from an intermediate model

-- In int_trips_unioned.sql
with green_trips as (
    select
        vendor_id,
        pickup_location_id,
        pickup_datetime,
        -- ... additional columns
        'Green' as service_type
    from {{ ref('stg_green_tripdata') }}
),

yellow_trips as (
    select
        vendor_id,
        pickup_location_id,
        pickup_datetime,
        -- ... additional columns
        'Yellow' as service_type
    from {{ ref('stg_yellow_tripdata') }}
)

select * from green_trips
union all
select * from yellow_trips

Running staging models only

# Build only the staging layer
dbt run --select staging

# Build a specific staging model
dbt run --select stg_green_tripdata

# Test staging models
dbt test --select staging

Column name standardization mapping

RAW (green)                 | RAW (yellow)                | STAGING (standardized)
----------------------------|-----------------------------|-----------------------
vendorid                    | vendorid                    | vendor_id
ratecodeid                  | ratecodeid                  | rate_code_id
pulocationid                | pulocationid                | pickup_location_id
dolocationid                | dolocationid                | dropoff_location_id
lpep_pickup_datetime        | tpep_pickup_datetime        | pickup_datetime
lpep_dropoff_datetime       | tpep_dropoff_datetime       | dropoff_datetime
store_and_fwd_flag          | store_and_fwd_flag          | store_and_fwd_flag
passenger_count             | passenger_count             | passenger_count
trip_distance               | trip_distance               | trip_distance
trip_type                   | (not present)               | trip_type
fare_amount                 | fare_amount                 | fare_amount
ehail_fee                   | (not present)               | ehail_fee
payment_type                | payment_type                | payment_type

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment