Implementation:DataTalksClub Data engineering zoomcamp Dbt Staging Models
| Page Metadata | |
|---|---|
| Knowledge Sources | repo: DataTalksClub/data-engineering-zoomcamp, dbt docs: dbt staging best practices |
| Domains | Analytics Engineering, Staging Models, Data Standardization |
| Last Updated | 2026-02-09 14:00 GMT |
Overview
Concrete dbt models implementing the staging layer for NYC taxi trip data, standardizing raw green and yellow taxi records into a clean, consistently-named interface materialized as views.
Description
The staging layer consists of two models that each map one-to-one with a raw source table:
stg_green_tripdata(20 output columns): Standardizes raw green taxi data fromsource('raw', 'green_tripdata'). Renames green-taxi-specific prefixes (e.g.,lpep_pickup_datetimetopickup_datetime) and includes green-only fields liketrip_typeandehail_fee.stg_yellow_tripdata(18 output columns): Standardizes raw yellow taxi data fromsource('raw', 'yellow_tripdata'). Renames yellow-taxi-specific prefixes (e.g.,tpep_pickup_datetimetopickup_datetime). Does not includetrip_typeorehail_fee(not present in yellow taxi data).
Both models follow the same CTE pattern (source, renamed, select), filter out null vendorid records, and apply a dev-environment date filter using project variables.
Usage
These models are the entry point for all downstream transformations. They are referenced by intermediate models via Template:Ref('stg green tripdata') and Template:Ref('stg yellow tripdata').
Code Reference
Source Location
04-analytics-engineering/taxi_rides_ny/models/staging/stg_green_tripdata.sql(Lines 1-43)04-analytics-engineering/taxi_rides_ny/models/staging/stg_yellow_tripdata.sql(Lines 1-42)04-analytics-engineering/taxi_rides_ny/models/staging/schema.yml(Lines 1-95)
Signature: stg_green_tripdata
with source as (
select * from {{ source('raw', 'green_tripdata') }}
),
renamed as (
select
-- identifiers
cast(vendorid as integer) as vendor_id,
{{ safe_cast('ratecodeid', 'integer') }} as rate_code_id,
cast(pulocationid as integer) as pickup_location_id,
cast(dolocationid as integer) as dropoff_location_id,
-- timestamps
cast(lpep_pickup_datetime as timestamp) as pickup_datetime,
cast(lpep_dropoff_datetime as timestamp) as dropoff_datetime,
-- trip info
cast(store_and_fwd_flag as string) as store_and_fwd_flag,
cast(passenger_count as integer) as passenger_count,
cast(trip_distance as numeric) as trip_distance,
{{ safe_cast('trip_type', 'integer') }} as trip_type,
-- payment info
cast(fare_amount as numeric) as fare_amount,
cast(extra as numeric) as extra,
cast(mta_tax as numeric) as mta_tax,
cast(tip_amount as numeric) as tip_amount,
cast(tolls_amount as numeric) as tolls_amount,
cast(ehail_fee as numeric) as ehail_fee,
cast(improvement_surcharge as numeric) as improvement_surcharge,
cast(total_amount as numeric) as total_amount,
{{ safe_cast('payment_type', 'integer') }} as payment_type
from source
-- Filter out records with null vendor_id (data quality requirement)
where vendorid is not null
)
select * from renamed
-- Sample records for dev environment using deterministic date filter
{% if target.name == 'dev' %}
where pickup_datetime >= '2019-01-01' and pickup_datetime < '2019-02-01'
{% endif %}
Signature: stg_yellow_tripdata
with source as (
select * from {{ source('raw', 'yellow_tripdata') }}
),
renamed as (
select
-- identifiers (standardized naming for consistency across yellow/green)
cast(vendorid as integer) as vendor_id,
cast(ratecodeid as integer) as rate_code_id,
cast(pulocationid as integer) as pickup_location_id,
cast(dolocationid as integer) as dropoff_location_id,
-- timestamps (standardized naming)
cast(tpep_pickup_datetime as timestamp) as pickup_datetime,
cast(tpep_dropoff_datetime as timestamp) as dropoff_datetime,
-- trip info
cast(store_and_fwd_flag as string) as store_and_fwd_flag,
cast(passenger_count as integer) as passenger_count,
cast(trip_distance as numeric) as trip_distance,
-- payment info
cast(fare_amount as numeric) as fare_amount,
cast(extra as numeric) as extra,
cast(mta_tax as numeric) as mta_tax,
cast(tip_amount as numeric) as tip_amount,
cast(tolls_amount as numeric) as tolls_amount,
cast(improvement_surcharge as numeric) as improvement_surcharge,
cast(total_amount as numeric) as total_amount,
cast(payment_type as integer) as payment_type
from source
-- Filter out records with null vendor_id (data quality requirement)
where vendorid is not null
)
select * from renamed
-- Sample records for dev environment using deterministic date filter
{% if target.name == 'dev' %}
where pickup_datetime >= '2019-01-01' and pickup_datetime < '2019-02-01'
{% endif %}
Import
Both staging models rely on the safe_cast macro and the source declaration:
-- safe_cast macro (macros/safe_cast.sql)
{% macro safe_cast(column, data_type) %}
{% if target.type == 'bigquery' %}
safe_cast({{ column }} as {{ data_type }})
{% else %}
cast({{ column }} as {{ data_type }})
{% endif %}
{% endmacro %}
Materialization is inherited from dbt_project.yml:
models:
taxi_rides_ny:
staging:
+materialized: view
I/O Contract
Inputs
| Input | Type | Description |
|---|---|---|
source('raw', 'green_tripdata') |
Source table | Raw green taxi trip records with 18 columns (lpep-prefixed timestamps) |
source('raw', 'yellow_tripdata') |
Source table | Raw yellow taxi trip records with 16 columns (tpep-prefixed timestamps) |
target.name |
Runtime variable | Environment name ('dev' or 'prod') controlling date filter
|
target.type |
Runtime variable | Adapter type ('bigquery' or 'duckdb') controlling safe_cast behavior
|
Outputs
| Output | Type | Columns | Description |
|---|---|---|---|
stg_green_tripdata |
View (20 columns) | vendor_id, rate_code_id, pickup_location_id, dropoff_location_id, pickup_datetime, dropoff_datetime, store_and_fwd_flag, passenger_count, trip_distance, trip_type, fare_amount, extra, mta_tax, tip_amount, tolls_amount, ehail_fee, improvement_surcharge, total_amount, payment_type | Standardized green taxi trips with consistent column names and types |
stg_yellow_tripdata |
View (18 columns) | vendor_id, rate_code_id, pickup_location_id, dropoff_location_id, pickup_datetime, dropoff_datetime, store_and_fwd_flag, passenger_count, trip_distance, fare_amount, extra, mta_tax, tip_amount, tolls_amount, improvement_surcharge, total_amount, payment_type | Standardized yellow taxi trips (no trip_type or ehail_fee) |
Usage Examples
Referencing a staging model from an intermediate model
-- In int_trips_unioned.sql
with green_trips as (
select
vendor_id,
pickup_location_id,
pickup_datetime,
-- ... additional columns
'Green' as service_type
from {{ ref('stg_green_tripdata') }}
),
yellow_trips as (
select
vendor_id,
pickup_location_id,
pickup_datetime,
-- ... additional columns
'Yellow' as service_type
from {{ ref('stg_yellow_tripdata') }}
)
select * from green_trips
union all
select * from yellow_trips
Running staging models only
# Build only the staging layer
dbt run --select staging
# Build a specific staging model
dbt run --select stg_green_tripdata
# Test staging models
dbt test --select staging
Column name standardization mapping
RAW (green) | RAW (yellow) | STAGING (standardized)
----------------------------|-----------------------------|-----------------------
vendorid | vendorid | vendor_id
ratecodeid | ratecodeid | rate_code_id
pulocationid | pulocationid | pickup_location_id
dolocationid | dolocationid | dropoff_location_id
lpep_pickup_datetime | tpep_pickup_datetime | pickup_datetime
lpep_dropoff_datetime | tpep_dropoff_datetime | dropoff_datetime
store_and_fwd_flag | store_and_fwd_flag | store_and_fwd_flag
passenger_count | passenger_count | passenger_count
trip_distance | trip_distance | trip_distance
trip_type | (not present) | trip_type
fare_amount | fare_amount | fare_amount
ehail_fee | (not present) | ehail_fee
payment_type | payment_type | payment_type
Related Pages
- Principle:DataTalksClub_Data_engineering_zoomcamp_Dbt_Staging_Layer
- Implementation:DataTalksClub_Data_engineering_zoomcamp_Dbt_Sources_Yml
- Implementation:DataTalksClub_Data_engineering_zoomcamp_Dbt_Intermediate_Models
- Heuristic:DataTalksClub_Data_engineering_zoomcamp_Dbt_Materialization_Strategy
- Environment:DataTalksClub_Data_engineering_zoomcamp_Dbt_DuckDB_Environment