Implementation:DataTalksClub Data engineering zoomcamp Dbt Sources Yml

Page Metadata
Knowledge Sources	repo: DataTalksClub/data-engineering-zoomcamp, dbt docs: dbt sources reference
Domains	Analytics Engineering, Source Management, Multi-Environment Configuration
Last Updated	2026-02-09 14:00 GMT

Overview

Concrete configuration pattern for declaring raw data sources in a dbt project, enabling multi-adapter deployments (BigQuery and DuckDB) with freshness monitoring and column-level documentation.

Description

The sources.yml file in the taxi_rides_ny project defines a single source named raw that abstracts two raw taxi trip tables (green_tripdata and yellow_tripdata). The source definition uses Jinja conditionals to resolve the physical database and schema based on the active dbt target type:

BigQuery target: Resolves to Template:Env var('GCP PROJECT ID').nytaxi.
DuckDB target: Resolves to taxi_rides_ny.prod.

Freshness thresholds are configured to warn after 24 hours and error after 48 hours of stale data, with loaded_at_field set to the pickup datetime column for each table. All 18-20 columns per table are documented with human-readable descriptions.

Usage

This source declaration is used when:

Staging models need to reference raw taxi data via Template:Source('raw', 'green tripdata') or Template:Source('raw', 'yellow tripdata').
Running dbt source freshness to verify upstream data pipelines have loaded recent data.
Generating documentation that includes raw table schemas.
Switching between BigQuery (cloud) and DuckDB (local) environments without changing model SQL.

Code Reference

Source Location

04-analytics-engineering/taxi_rides_ny/models/staging/sources.yml (Lines 1-100)

Signature

sources:
  - name: raw
    description: Raw taxi trip data from NYC TLC
    database: |
      {%- if target.type == 'bigquery' -%}
        {{ env_var('GCP_PROJECT_ID', 'please-add-your-gcp-project-id-here') }}
      {%- else -%}
        taxi_rides_ny
      {%- endif -%}
    schema: |
      {%- if target.type == 'bigquery' -%}
        nytaxi
      {%- else -%}
        prod
      {%- endif -%}
    freshness:
      warn_after: {count: 24, period: hour}
      error_after: {count: 48, period: hour}
    tables:
      - name: green_tripdata
        description: Raw green taxi trip records
        loaded_at_field: lpep_pickup_datetime
        columns:
          - name: vendorid
            description: "Taxi technology provider (1 = Creative Mobile Technologies, 2 = VeriFone Inc.)"
          - name: lpep_pickup_datetime
            description: Date and time when the meter was engaged
          - name: lpep_dropoff_datetime
            description: Date and time when the meter was disengaged
          # ... (18 columns total)

      - name: yellow_tripdata
        description: Raw yellow taxi trip records
        loaded_at_field: tpep_pickup_datetime
        columns:
          - name: vendorid
            description: "Taxi technology provider (1 = Creative Mobile Technologies, 2 = VeriFone Inc.)"
          - name: tpep_pickup_datetime
            description: Date and time when the meter was engaged
          - name: tpep_dropoff_datetime
            description: Date and time when the meter was disengaged
          # ... (16 columns total)

Import

Sources are automatically discovered by dbt when placed in the model-paths directory. To reference a source in a model:

{{ source('raw', 'green_tripdata') }}
{{ source('raw', 'yellow_tripdata') }}

To check freshness from the command line:

dbt source freshness

I/O Contract

Inputs

Input	Type	Description
`GCP_PROJECT_ID` environment variable	String	GCP project ID, required when `target.type == 'bigquery'`; defaults to `'please-add-your-gcp-project-id-here'`
`target.type`	String	dbt adapter type (`'bigquery'` or `'duckdb'`), set in `profiles.yml`
`green_tripdata` raw table	Database table	Raw green taxi trip records with 18 columns including `lpep_pickup_datetime`
`yellow_tripdata` raw table	Database table	Raw yellow taxi trip records with 16 columns including `tpep_pickup_datetime`

Outputs

Output	Type	Description
`source('raw', 'green_tripdata')`	Source reference	Resolves to the physical green taxi table based on target environment
`source('raw', 'yellow_tripdata')`	Source reference	Resolves to the physical yellow taxi table based on target environment
Freshness status	Pass/Warn/Error	Result of `dbt source freshness` based on 24h warn / 48h error thresholds
Column documentation	Metadata	Human-readable descriptions for all source columns in generated docs

Usage Examples

Referencing a source in a staging model

-- stg_green_tripdata.sql
with source as (
    select * from {{ source('raw', 'green_tripdata') }}
),

renamed as (
    select
        cast(vendorid as integer) as vendor_id,
        cast(lpep_pickup_datetime as timestamp) as pickup_datetime,
        -- ... additional column standardization
    from source
    where vendorid is not null
)

select * from renamed

Running freshness checks

# Check if raw data was loaded within expected thresholds
dbt source freshness

# Output example:
# 14:32:01  1 of 2 WARN freshness of raw.green_tripdata ........... [WARN in 0.45s]
# 14:32:02  2 of 2 PASS freshness of raw.yellow_tripdata .......... [PASS in 0.38s]

Multi-adapter resolution

# When target.type == 'bigquery':
#   database = "my-gcp-project"
#   schema   = "nytaxi"
#   Resolved: my-gcp-project.nytaxi.green_tripdata

# When target.type == 'duckdb':
#   database = "taxi_rides_ny"
#   schema   = "prod"
#   Resolved: taxi_rides_ny.prod.green_tripdata

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment