Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:DataTalksClub Data engineering zoomcamp dbt Analytics Transformation

From Leeroopedia


Knowledge Sources
Domains Data_Engineering, Analytics_Engineering, Data_Transformation
Last Updated 2026-02-09 07:00 GMT

Overview

End-to-end analytics engineering workflow using dbt to transform raw NYC taxi trip data through staging, intermediate, and marts layers into analysis-ready reporting tables.

Description

This workflow implements the analytics engineering best practice of layered data transformation using dbt (data build tool). Raw taxi trip data from green and yellow taxi sources is progressively refined through three layers: staging (cleaning and standardization), intermediate (business logic and unioning), and marts (final aggregations for reporting). The project supports dual execution paths: local development with DuckDB and cloud deployment with BigQuery. It includes custom macros, seed data for lookup tables, comprehensive schema tests, and generated documentation.

Usage

Execute this workflow when raw taxi trip data has been loaded into your data warehouse (DuckDB locally or BigQuery in the cloud) and you need to build clean, tested, documented transformation models. Use this after completing the data ingestion workflows (Docker PostgreSQL or Kestra ETL) and before building dashboards or performing analysis.

Execution Steps

Step 1: Project Configuration

Initialize the dbt project with the correct profile, model paths, and materialization strategies. The project configuration defines where dbt looks for models, seeds, macros, and tests. Staging models are materialized as views for efficiency, while intermediate and marts models are materialized as tables for query performance.

Key considerations:

  • Profile configuration differs between DuckDB (local) and BigQuery (cloud) targets
  • The target.type variable enables conditional logic for cross-database compatibility
  • Development variables (dev_start_date, dev_end_date) limit data volume during development

Step 2: Source Declaration

Declare raw data sources in a YAML schema file, mapping the physical tables in the warehouse to logical source references used by dbt models. Source definitions include column descriptions, freshness checks, and database/schema routing that varies by target type.

Key considerations:

  • Sources use Jinja conditionals to resolve database and schema names per target
  • Freshness checks warn at 24 hours and error at 48 hours of stale data
  • Both green_tripdata and yellow_tripdata sources are declared with full column documentation

Step 3: Staging Layer

Build staging models that clean and standardize raw source data. Each source table gets its own staging model that renames columns, casts data types, filters invalid records, and applies consistent naming conventions. Staging models are materialized as views.

Key considerations:

  • Filter out records with null vendor IDs and invalid trip distances
  • Standardize column names across green and yellow taxi schemas
  • Apply the payment_type_description macro to decode numeric payment types
  • Date filtering in dev environment limits processing to a configurable window

Step 4: Intermediate Layer

Combine staged green and yellow taxi data into unified trip records. The intermediate layer applies business logic to union the two datasets, adding a service_type discriminator column. This layer resolves schema differences between the two taxi types.

Key considerations:

  • Union operation requires matching column sets between green and yellow data
  • Service type column enables downstream filtering and grouping
  • Materialized as tables for query performance in the marts layer

Step 5: Marts Layer

Build final reporting tables optimized for analytics consumption. Marts models aggregate trip data by time periods and zones, join with lookup data (taxi zone seed tables), and compute revenue metrics. These are the tables that dashboards and analysts query directly.

Key considerations:

  • Fact tables join trip data with dimension tables (taxi zones)
  • Revenue aggregations group by pickup zone, month, and service type
  • Materialized as tables for fast dashboard queries

Step 6: Testing and Documentation

Run dbt tests to validate data quality across all layers. Tests include schema tests (not_null, unique, accepted_values, relationships) and custom data tests. Generate dbt documentation to create a browsable data catalog with lineage graphs.

Key considerations:

  • Schema tests are defined in YAML files alongside model definitions
  • The dbt_utils package provides additional test types
  • Documentation generation creates a static site with DAG visualization
  • Run dbt test after dbt build to validate all models

Execution Diagram

GitHub URL

Workflow Repository