Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:DataTalksClub Data engineering zoomcamp Dbt Sources Yml

From Leeroopedia


Page Metadata
Knowledge Sources repo: DataTalksClub/data-engineering-zoomcamp, dbt docs: dbt sources reference
Domains Analytics Engineering, Source Management, Multi-Environment Configuration
Last Updated 2026-02-09 14:00 GMT

Overview

Concrete configuration pattern for declaring raw data sources in a dbt project, enabling multi-adapter deployments (BigQuery and DuckDB) with freshness monitoring and column-level documentation.

Description

The sources.yml file in the taxi_rides_ny project defines a single source named raw that abstracts two raw taxi trip tables (green_tripdata and yellow_tripdata). The source definition uses Jinja conditionals to resolve the physical database and schema based on the active dbt target type:

Freshness thresholds are configured to warn after 24 hours and error after 48 hours of stale data, with loaded_at_field set to the pickup datetime column for each table. All 18-20 columns per table are documented with human-readable descriptions.

Usage

This source declaration is used when:

Code Reference

Source Location

04-analytics-engineering/taxi_rides_ny/models/staging/sources.yml (Lines 1-100)

Signature

sources:
  - name: raw
    description: Raw taxi trip data from NYC TLC
    database: |
      {%- if target.type == 'bigquery' -%}
        {{ env_var('GCP_PROJECT_ID', 'please-add-your-gcp-project-id-here') }}
      {%- else -%}
        taxi_rides_ny
      {%- endif -%}
    schema: |
      {%- if target.type == 'bigquery' -%}
        nytaxi
      {%- else -%}
        prod
      {%- endif -%}
    freshness:
      warn_after: {count: 24, period: hour}
      error_after: {count: 48, period: hour}
    tables:
      - name: green_tripdata
        description: Raw green taxi trip records
        loaded_at_field: lpep_pickup_datetime
        columns:
          - name: vendorid
            description: "Taxi technology provider (1 = Creative Mobile Technologies, 2 = VeriFone Inc.)"
          - name: lpep_pickup_datetime
            description: Date and time when the meter was engaged
          - name: lpep_dropoff_datetime
            description: Date and time when the meter was disengaged
          # ... (18 columns total)

      - name: yellow_tripdata
        description: Raw yellow taxi trip records
        loaded_at_field: tpep_pickup_datetime
        columns:
          - name: vendorid
            description: "Taxi technology provider (1 = Creative Mobile Technologies, 2 = VeriFone Inc.)"
          - name: tpep_pickup_datetime
            description: Date and time when the meter was engaged
          - name: tpep_dropoff_datetime
            description: Date and time when the meter was disengaged
          # ... (16 columns total)

Import

Sources are automatically discovered by dbt when placed in the model-paths directory. To reference a source in a model:

{{ source('raw', 'green_tripdata') }}
{{ source('raw', 'yellow_tripdata') }}

To check freshness from the command line:

dbt source freshness

I/O Contract

Inputs

Input Type Description
GCP_PROJECT_ID environment variable String GCP project ID, required when target.type == 'bigquery'; defaults to 'please-add-your-gcp-project-id-here'
target.type String dbt adapter type ('bigquery' or 'duckdb'), set in profiles.yml
green_tripdata raw table Database table Raw green taxi trip records with 18 columns including lpep_pickup_datetime
yellow_tripdata raw table Database table Raw yellow taxi trip records with 16 columns including tpep_pickup_datetime

Outputs

Output Type Description
source('raw', 'green_tripdata') Source reference Resolves to the physical green taxi table based on target environment
source('raw', 'yellow_tripdata') Source reference Resolves to the physical yellow taxi table based on target environment
Freshness status Pass/Warn/Error Result of dbt source freshness based on 24h warn / 48h error thresholds
Column documentation Metadata Human-readable descriptions for all source columns in generated docs

Usage Examples

Referencing a source in a staging model

-- stg_green_tripdata.sql
with source as (
    select * from {{ source('raw', 'green_tripdata') }}
),

renamed as (
    select
        cast(vendorid as integer) as vendor_id,
        cast(lpep_pickup_datetime as timestamp) as pickup_datetime,
        -- ... additional column standardization
    from source
    where vendorid is not null
)

select * from renamed

Running freshness checks

# Check if raw data was loaded within expected thresholds
dbt source freshness

# Output example:
# 14:32:01  1 of 2 WARN freshness of raw.green_tripdata ........... [WARN in 0.45s]
# 14:32:02  2 of 2 PASS freshness of raw.yellow_tripdata .......... [PASS in 0.38s]

Multi-adapter resolution

# When target.type == 'bigquery':
#   database = "my-gcp-project"
#   schema   = "nytaxi"
#   Resolved: my-gcp-project.nytaxi.green_tripdata

# When target.type == 'duckdb':
#   database = "taxi_rides_ny"
#   schema   = "prod"
#   Resolved: taxi_rides_ny.prod.green_tripdata

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment