Workflow:Dagster io Dagster ETL Pipeline

Knowledge Sources	Dagster Dagster Docs ETL Pipeline Tutorial
Domains	Data_Engineering, ETL, Data_Quality
Last Updated	2026-02-10 12:00 GMT

Overview

End-to-end process for building a production-grade ETL pipeline using Dagster, DuckDB, dbt, and Evidence BI for data extraction, transformation, quality validation, partitioning, and automated visualization.

Description

This workflow outlines the standard procedure for constructing a complete ETL (Extract, Transform, Load) pipeline orchestrated by Dagster. It demonstrates the core Dagster programming model by ingesting CSV data into a DuckDB analytical database, transforming it with dbt models exposed as Dagster assets via the component system, enforcing data quality through asset checks, implementing incremental processing with time-based partitions, and automating execution with declarative schedules and sensors. The final step connects the pipeline output to an Evidence BI dashboard for visualization.

Usage

Execute this workflow when you need to build a data pipeline that ingests structured data from files or APIs, transforms it through SQL models, validates data quality at each stage, processes data incrementally over time periods, and presents results in a BI dashboard. This is the recommended starting point for teams adopting Dagster for data engineering workloads.

Execution Steps

Step 1: Project Scaffolding

Initialize a new Dagster project using the CLI scaffolding tool. This generates the standard project structure including asset definitions, resource configurations, and development server setup. The scaffolded project provides a working skeleton that can be launched immediately with the development server.

Key considerations:

Requires Python 3.10+ and uv package manager
The project template includes pre-configured DuckDB integration
Use the development server to verify the project loads correctly before adding custom assets

Step 2: Data Extraction

Define Dagster assets that ingest raw data from CSV files into DuckDB tables. Each source table is represented as a separate asset (e.g., raw_customers, raw_orders, raw_payments). The assets use file locking to ensure serial database access and employ CREATE OR REPLACE TABLE statements for idempotent materialization.

Key considerations:

Each raw data source maps to one Dagster asset
Use DuckDB's native CSV reading capabilities for efficient ingestion
Implement file locking when multiple assets share the same database file
Assets are idempotent: re-materializing replaces the existing table

Step 3: Data Transformation with dbt

Integrate a dbt project into the Dagster pipeline using the DbtProjectComponent. This automatically converts dbt models into Dagster assets with full lineage tracking. The component system uses declarative YAML configuration rather than Python code to define the dbt integration.

Key considerations:

Use the component scaffolding command to generate the dbt integration boilerplate
dbt models automatically inherit dependency relationships as Dagster asset edges
The dbt-duckdb adapter enables the same DuckDB database for both raw ingestion and transformations
Lineage flows automatically from raw ingestion assets through dbt transformation assets

Step 4: Resource Configuration

Centralize external dependency management by defining a DuckDB resource. This replaces individual connection management in each asset with a shared, configurable resource injected by Dagster at runtime. Resources ensure consistent configuration across all assets and checks.

Key considerations:

Define resources in the Definitions object with string keys
Assets receive resources via parameter name matching
Resource configuration can vary between development and production environments
The DuckDB resource handles connection pooling and lifecycle management

Step 5: Data Quality Checks

Implement asset checks that validate data integrity after each materialization. Checks use SQL queries against DuckDB to verify constraints such as non-null primary keys, referential integrity, and business rule compliance. Each check returns a pass/fail result visible in the Dagster UI.

Key considerations:

Asset checks run automatically during materialization or on demand via the UI
Checks access the same DuckDB resource as the assets they validate
Failed checks do not block downstream execution by default but provide visibility
Use AssetCheckResult to report pass/fail status with optional metadata

Step 6: Time-Based Partitioning

Add monthly partitions to assets that process data incrementally. Partitioned assets use the partition key to filter data for a specific time window, enabling efficient reprocessing of individual periods without re-materializing the entire dataset. The implementation uses idempotent DELETE/INSERT patterns.

Key considerations:

MonthlyPartitionsDefinition generates one partition per calendar month
The partition key is accessible via context.partition_key within the asset function
Backfill operations can be triggered from the Dagster UI for historical partition ranges
Idempotent patterns (DELETE then INSERT) ensure partitions can be safely re-materialized

Step 7: Pipeline Automation

Configure declarative automation to schedule and trigger asset materialization. Ingestion assets run on a daily cron schedule using AutomationCondition.on_cron(), while downstream transformation and analytics assets trigger reactively using AutomationCondition.eager() when their upstream dependencies complete.

Key considerations:

Declarative automation is defined directly on asset definitions, not as separate schedule objects
Cron-based conditions suit periodic source data refreshes
Eager conditions propagate materializations through the dependency graph automatically
The Dagster daemon must be running for automation to execute

Execution Diagram

GitHub URL

Workflow Repository