Workflow:DataTalksClub Data engineering zoomcamp dbt Analytics Transformation
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, Analytics_Engineering, Data_Transformation |
| Last Updated | 2026-02-09 07:00 GMT |
Overview
End-to-end analytics engineering workflow using dbt to transform raw NYC taxi trip data through staging, intermediate, and marts layers into analysis-ready reporting tables.
Description
This workflow implements the analytics engineering best practice of layered data transformation using dbt (data build tool). Raw taxi trip data from green and yellow taxi sources is progressively refined through three layers: staging (cleaning and standardization), intermediate (business logic and unioning), and marts (final aggregations for reporting). The project supports dual execution paths: local development with DuckDB and cloud deployment with BigQuery. It includes custom macros, seed data for lookup tables, comprehensive schema tests, and generated documentation.
Usage
Execute this workflow when raw taxi trip data has been loaded into your data warehouse (DuckDB locally or BigQuery in the cloud) and you need to build clean, tested, documented transformation models. Use this after completing the data ingestion workflows (Docker PostgreSQL or Kestra ETL) and before building dashboards or performing analysis.
Execution Steps
Step 1: Project Configuration
Initialize the dbt project with the correct profile, model paths, and materialization strategies. The project configuration defines where dbt looks for models, seeds, macros, and tests. Staging models are materialized as views for efficiency, while intermediate and marts models are materialized as tables for query performance.
Key considerations:
- Profile configuration differs between DuckDB (local) and BigQuery (cloud) targets
- The target.type variable enables conditional logic for cross-database compatibility
- Development variables (dev_start_date, dev_end_date) limit data volume during development
Step 2: Source Declaration
Declare raw data sources in a YAML schema file, mapping the physical tables in the warehouse to logical source references used by dbt models. Source definitions include column descriptions, freshness checks, and database/schema routing that varies by target type.
Key considerations:
- Sources use Jinja conditionals to resolve database and schema names per target
- Freshness checks warn at 24 hours and error at 48 hours of stale data
- Both green_tripdata and yellow_tripdata sources are declared with full column documentation
Step 3: Staging Layer
Build staging models that clean and standardize raw source data. Each source table gets its own staging model that renames columns, casts data types, filters invalid records, and applies consistent naming conventions. Staging models are materialized as views.
Key considerations:
- Filter out records with null vendor IDs and invalid trip distances
- Standardize column names across green and yellow taxi schemas
- Apply the payment_type_description macro to decode numeric payment types
- Date filtering in dev environment limits processing to a configurable window
Step 4: Intermediate Layer
Combine staged green and yellow taxi data into unified trip records. The intermediate layer applies business logic to union the two datasets, adding a service_type discriminator column. This layer resolves schema differences between the two taxi types.
Key considerations:
- Union operation requires matching column sets between green and yellow data
- Service type column enables downstream filtering and grouping
- Materialized as tables for query performance in the marts layer
Step 5: Marts Layer
Build final reporting tables optimized for analytics consumption. Marts models aggregate trip data by time periods and zones, join with lookup data (taxi zone seed tables), and compute revenue metrics. These are the tables that dashboards and analysts query directly.
Key considerations:
- Fact tables join trip data with dimension tables (taxi zones)
- Revenue aggregations group by pickup zone, month, and service type
- Materialized as tables for fast dashboard queries
Step 6: Testing and Documentation
Run dbt tests to validate data quality across all layers. Tests include schema tests (not_null, unique, accepted_values, relationships) and custom data tests. Generate dbt documentation to create a browsable data catalog with lineage graphs.
Key considerations:
- Schema tests are defined in YAML files alongside model definitions
- The dbt_utils package provides additional test types
- Documentation generation creates a static site with DAG visualization
- Run dbt test after dbt build to validate all models