Implementation:DataTalksClub Data engineering zoomcamp Dbt Sources Yml
| Page Metadata | |
|---|---|
| Knowledge Sources | repo: DataTalksClub/data-engineering-zoomcamp, dbt docs: dbt sources reference |
| Domains | Analytics Engineering, Source Management, Multi-Environment Configuration |
| Last Updated | 2026-02-09 14:00 GMT |
Overview
Concrete configuration pattern for declaring raw data sources in a dbt project, enabling multi-adapter deployments (BigQuery and DuckDB) with freshness monitoring and column-level documentation.
Description
The sources.yml file in the taxi_rides_ny project defines a single source named raw that abstracts two raw taxi trip tables (green_tripdata and yellow_tripdata). The source definition uses Jinja conditionals to resolve the physical database and schema based on the active dbt target type:
- BigQuery target: Resolves to
Template:Env var('GCP PROJECT ID').nytaxi. - DuckDB target: Resolves to
taxi_rides_ny.prod.
Freshness thresholds are configured to warn after 24 hours and error after 48 hours of stale data, with loaded_at_field set to the pickup datetime column for each table. All 18-20 columns per table are documented with human-readable descriptions.
Usage
This source declaration is used when:
- Staging models need to reference raw taxi data via
Template:Source('raw', 'green tripdata')orTemplate:Source('raw', 'yellow tripdata'). - Running
dbt source freshnessto verify upstream data pipelines have loaded recent data. - Generating documentation that includes raw table schemas.
- Switching between BigQuery (cloud) and DuckDB (local) environments without changing model SQL.
Code Reference
Source Location
04-analytics-engineering/taxi_rides_ny/models/staging/sources.yml (Lines 1-100)
Signature
sources:
- name: raw
description: Raw taxi trip data from NYC TLC
database: |
{%- if target.type == 'bigquery' -%}
{{ env_var('GCP_PROJECT_ID', 'please-add-your-gcp-project-id-here') }}
{%- else -%}
taxi_rides_ny
{%- endif -%}
schema: |
{%- if target.type == 'bigquery' -%}
nytaxi
{%- else -%}
prod
{%- endif -%}
freshness:
warn_after: {count: 24, period: hour}
error_after: {count: 48, period: hour}
tables:
- name: green_tripdata
description: Raw green taxi trip records
loaded_at_field: lpep_pickup_datetime
columns:
- name: vendorid
description: "Taxi technology provider (1 = Creative Mobile Technologies, 2 = VeriFone Inc.)"
- name: lpep_pickup_datetime
description: Date and time when the meter was engaged
- name: lpep_dropoff_datetime
description: Date and time when the meter was disengaged
# ... (18 columns total)
- name: yellow_tripdata
description: Raw yellow taxi trip records
loaded_at_field: tpep_pickup_datetime
columns:
- name: vendorid
description: "Taxi technology provider (1 = Creative Mobile Technologies, 2 = VeriFone Inc.)"
- name: tpep_pickup_datetime
description: Date and time when the meter was engaged
- name: tpep_dropoff_datetime
description: Date and time when the meter was disengaged
# ... (16 columns total)
Import
Sources are automatically discovered by dbt when placed in the model-paths directory. To reference a source in a model:
{{ source('raw', 'green_tripdata') }}
{{ source('raw', 'yellow_tripdata') }}
To check freshness from the command line:
dbt source freshness
I/O Contract
Inputs
| Input | Type | Description |
|---|---|---|
GCP_PROJECT_ID environment variable |
String | GCP project ID, required when target.type == 'bigquery'; defaults to 'please-add-your-gcp-project-id-here'
|
target.type |
String | dbt adapter type ('bigquery' or 'duckdb'), set in profiles.yml
|
green_tripdata raw table |
Database table | Raw green taxi trip records with 18 columns including lpep_pickup_datetime
|
yellow_tripdata raw table |
Database table | Raw yellow taxi trip records with 16 columns including tpep_pickup_datetime
|
Outputs
| Output | Type | Description |
|---|---|---|
source('raw', 'green_tripdata') |
Source reference | Resolves to the physical green taxi table based on target environment |
source('raw', 'yellow_tripdata') |
Source reference | Resolves to the physical yellow taxi table based on target environment |
| Freshness status | Pass/Warn/Error | Result of dbt source freshness based on 24h warn / 48h error thresholds
|
| Column documentation | Metadata | Human-readable descriptions for all source columns in generated docs |
Usage Examples
Referencing a source in a staging model
-- stg_green_tripdata.sql
with source as (
select * from {{ source('raw', 'green_tripdata') }}
),
renamed as (
select
cast(vendorid as integer) as vendor_id,
cast(lpep_pickup_datetime as timestamp) as pickup_datetime,
-- ... additional column standardization
from source
where vendorid is not null
)
select * from renamed
Running freshness checks
# Check if raw data was loaded within expected thresholds
dbt source freshness
# Output example:
# 14:32:01 1 of 2 WARN freshness of raw.green_tripdata ........... [WARN in 0.45s]
# 14:32:02 2 of 2 PASS freshness of raw.yellow_tripdata .......... [PASS in 0.38s]
Multi-adapter resolution
# When target.type == 'bigquery':
# database = "my-gcp-project"
# schema = "nytaxi"
# Resolved: my-gcp-project.nytaxi.green_tripdata
# When target.type == 'duckdb':
# database = "taxi_rides_ny"
# schema = "prod"
# Resolved: taxi_rides_ny.prod.green_tripdata