Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:DataTalksClub Data engineering zoomcamp Dbt Source Declaration

From Leeroopedia


Page Metadata
Knowledge Sources dbt sources documentation, analytics engineering best practices
Domains Analytics Engineering, Data Abstraction, Source Management
Last Updated 2026-02-09 14:00 GMT

Overview

Abstracting raw data references through source declarations decouples transformation logic from physical database locations, enabling portable, multi-environment deployments with built-in freshness monitoring.

Description

The principle of source declaration holds that transformation models should never reference raw tables by their physical database path. Instead, a declarative source definition maps logical names to physical locations, providing several critical benefits:

  • Location abstraction: Models reference source('raw', 'green_tripdata') rather than project.dataset.green_tripdata. If the physical location changes, only the source definition needs updating.
  • Multi-environment portability: Conditional logic in source definitions allows the same models to run against different databases or schemas depending on the target environment (e.g., BigQuery in production, DuckDB in development).
  • Freshness monitoring: Source definitions include freshness thresholds that alert when raw data has not been updated within expected intervals, catching pipeline failures before they propagate to analytics.
  • Documentation at the boundary: Column-level descriptions on source tables document the raw data contract, making explicit what the transformation layer expects from upstream systems.

This principle enforces a clean boundary between the ingestion layer (raw data loading) and the transformation layer (dbt models), following the broader architectural pattern of dependency inversion in data systems.

Usage

Use source declarations when:

  • Raw data tables are loaded by an external process (e.g., an ingestion pipeline, Fivetran, Airbyte).
  • The same transformation logic must run against different databases or schemas (e.g., BigQuery for production, DuckDB for local development).
  • Data freshness must be monitored as part of pipeline health.
  • Column-level documentation is needed for raw tables that the transformation layer consumes.
  • Multiple staging models reference the same raw tables, and a single point of definition is desired.

Theoretical Basis

Dependency Inversion in Data Pipelines

In traditional software engineering, the Dependency Inversion Principle states that high-level modules should not depend on low-level modules; both should depend on abstractions. Source declarations apply this principle to data pipelines:

WITHOUT SOURCE DECLARATIONS:
  staging_model --> physical_table("project.dataset.green_tripdata")
  (tight coupling: model breaks if table moves)

WITH SOURCE DECLARATIONS:
  staging_model --> source('raw', 'green_tripdata') --> source_definition --> physical_table
  (loose coupling: only source definition needs updating)

Environment-Aware Resolution

Source declarations support conditional logic that resolves physical locations at compile time:

function resolve_source(source_name, table_name, target):
    source_def = load_source_definition(source_name)

    if target.type == "bigquery":
        database = env_var("GCP_PROJECT_ID")
        schema = "nytaxi"
    else:  -- duckdb, postgres, etc.
        database = source_def.default_database
        schema = "prod"

    return database.schema.table_name

This allows a single set of models to operate transparently across cloud warehouses and local development databases.

Freshness Monitoring

Source freshness checks compare the most recent value in a loaded_at_field against the current time:

function check_freshness(source_table, freshness_config):
    max_loaded_at = query("SELECT MAX({loaded_at_field}) FROM {source_table}")
    staleness = current_time() - max_loaded_at

    if staleness > freshness_config.error_after:
        raise ERROR("Source data is stale beyond error threshold")
    elif staleness > freshness_config.warn_after:
        raise WARNING("Source data is stale beyond warning threshold")
    else:
        return PASS

This pattern turns raw data availability into a testable contract, allowing teams to detect upstream failures proactively.

Column Documentation as Contract

Documenting columns at the source level creates a data contract between the ingestion layer and the transformation layer. When a column's type or semantics change upstream, the source documentation serves as the reference for what was originally expected.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment