Principle:DataTalksClub Data engineering zoomcamp Dbt Source Declaration
| Page Metadata | |
|---|---|
| Knowledge Sources | dbt sources documentation, analytics engineering best practices |
| Domains | Analytics Engineering, Data Abstraction, Source Management |
| Last Updated | 2026-02-09 14:00 GMT |
Overview
Abstracting raw data references through source declarations decouples transformation logic from physical database locations, enabling portable, multi-environment deployments with built-in freshness monitoring.
Description
The principle of source declaration holds that transformation models should never reference raw tables by their physical database path. Instead, a declarative source definition maps logical names to physical locations, providing several critical benefits:
- Location abstraction: Models reference
source('raw', 'green_tripdata')rather thanproject.dataset.green_tripdata. If the physical location changes, only the source definition needs updating. - Multi-environment portability: Conditional logic in source definitions allows the same models to run against different databases or schemas depending on the target environment (e.g., BigQuery in production, DuckDB in development).
- Freshness monitoring: Source definitions include
freshnessthresholds that alert when raw data has not been updated within expected intervals, catching pipeline failures before they propagate to analytics. - Documentation at the boundary: Column-level descriptions on source tables document the raw data contract, making explicit what the transformation layer expects from upstream systems.
This principle enforces a clean boundary between the ingestion layer (raw data loading) and the transformation layer (dbt models), following the broader architectural pattern of dependency inversion in data systems.
Usage
Use source declarations when:
- Raw data tables are loaded by an external process (e.g., an ingestion pipeline, Fivetran, Airbyte).
- The same transformation logic must run against different databases or schemas (e.g., BigQuery for production, DuckDB for local development).
- Data freshness must be monitored as part of pipeline health.
- Column-level documentation is needed for raw tables that the transformation layer consumes.
- Multiple staging models reference the same raw tables, and a single point of definition is desired.
Theoretical Basis
Dependency Inversion in Data Pipelines
In traditional software engineering, the Dependency Inversion Principle states that high-level modules should not depend on low-level modules; both should depend on abstractions. Source declarations apply this principle to data pipelines:
WITHOUT SOURCE DECLARATIONS:
staging_model --> physical_table("project.dataset.green_tripdata")
(tight coupling: model breaks if table moves)
WITH SOURCE DECLARATIONS:
staging_model --> source('raw', 'green_tripdata') --> source_definition --> physical_table
(loose coupling: only source definition needs updating)
Environment-Aware Resolution
Source declarations support conditional logic that resolves physical locations at compile time:
function resolve_source(source_name, table_name, target):
source_def = load_source_definition(source_name)
if target.type == "bigquery":
database = env_var("GCP_PROJECT_ID")
schema = "nytaxi"
else: -- duckdb, postgres, etc.
database = source_def.default_database
schema = "prod"
return database.schema.table_name
This allows a single set of models to operate transparently across cloud warehouses and local development databases.
Freshness Monitoring
Source freshness checks compare the most recent value in a loaded_at_field against the current time:
function check_freshness(source_table, freshness_config):
max_loaded_at = query("SELECT MAX({loaded_at_field}) FROM {source_table}")
staleness = current_time() - max_loaded_at
if staleness > freshness_config.error_after:
raise ERROR("Source data is stale beyond error threshold")
elif staleness > freshness_config.warn_after:
raise WARNING("Source data is stale beyond warning threshold")
else:
return PASS
This pattern turns raw data availability into a testable contract, allowing teams to detect upstream failures proactively.
Column Documentation as Contract
Documenting columns at the source level creates a data contract between the ingestion layer and the transformation layer. When a column's type or semantics change upstream, the source documentation serves as the reference for what was originally expected.