Workflow:Dagster io Dagster ETL Pipeline
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, ETL, Data_Quality |
| Last Updated | 2026-02-10 12:00 GMT |
Overview
End-to-end process for building a production-grade ETL pipeline using Dagster, DuckDB, dbt, and Evidence BI for data extraction, transformation, quality validation, partitioning, and automated visualization.
Description
This workflow outlines the standard procedure for constructing a complete ETL (Extract, Transform, Load) pipeline orchestrated by Dagster. It demonstrates the core Dagster programming model by ingesting CSV data into a DuckDB analytical database, transforming it with dbt models exposed as Dagster assets via the component system, enforcing data quality through asset checks, implementing incremental processing with time-based partitions, and automating execution with declarative schedules and sensors. The final step connects the pipeline output to an Evidence BI dashboard for visualization.
Usage
Execute this workflow when you need to build a data pipeline that ingests structured data from files or APIs, transforms it through SQL models, validates data quality at each stage, processes data incrementally over time periods, and presents results in a BI dashboard. This is the recommended starting point for teams adopting Dagster for data engineering workloads.
Execution Steps
Step 1: Project Scaffolding
Initialize a new Dagster project using the CLI scaffolding tool. This generates the standard project structure including asset definitions, resource configurations, and development server setup. The scaffolded project provides a working skeleton that can be launched immediately with the development server.
Key considerations:
- Requires Python 3.10+ and uv package manager
- The project template includes pre-configured DuckDB integration
- Use the development server to verify the project loads correctly before adding custom assets
Step 2: Data Extraction
Define Dagster assets that ingest raw data from CSV files into DuckDB tables. Each source table is represented as a separate asset (e.g., raw_customers, raw_orders, raw_payments). The assets use file locking to ensure serial database access and employ CREATE OR REPLACE TABLE statements for idempotent materialization.
Key considerations:
- Each raw data source maps to one Dagster asset
- Use DuckDB's native CSV reading capabilities for efficient ingestion
- Implement file locking when multiple assets share the same database file
- Assets are idempotent: re-materializing replaces the existing table
Step 3: Data Transformation with dbt
Integrate a dbt project into the Dagster pipeline using the DbtProjectComponent. This automatically converts dbt models into Dagster assets with full lineage tracking. The component system uses declarative YAML configuration rather than Python code to define the dbt integration.
Key considerations:
- Use the component scaffolding command to generate the dbt integration boilerplate
- dbt models automatically inherit dependency relationships as Dagster asset edges
- The dbt-duckdb adapter enables the same DuckDB database for both raw ingestion and transformations
- Lineage flows automatically from raw ingestion assets through dbt transformation assets
Step 4: Resource Configuration
Centralize external dependency management by defining a DuckDB resource. This replaces individual connection management in each asset with a shared, configurable resource injected by Dagster at runtime. Resources ensure consistent configuration across all assets and checks.
Key considerations:
- Define resources in the Definitions object with string keys
- Assets receive resources via parameter name matching
- Resource configuration can vary between development and production environments
- The DuckDB resource handles connection pooling and lifecycle management
Step 5: Data Quality Checks
Implement asset checks that validate data integrity after each materialization. Checks use SQL queries against DuckDB to verify constraints such as non-null primary keys, referential integrity, and business rule compliance. Each check returns a pass/fail result visible in the Dagster UI.
Key considerations:
- Asset checks run automatically during materialization or on demand via the UI
- Checks access the same DuckDB resource as the assets they validate
- Failed checks do not block downstream execution by default but provide visibility
- Use AssetCheckResult to report pass/fail status with optional metadata
Step 6: Time-Based Partitioning
Add monthly partitions to assets that process data incrementally. Partitioned assets use the partition key to filter data for a specific time window, enabling efficient reprocessing of individual periods without re-materializing the entire dataset. The implementation uses idempotent DELETE/INSERT patterns.
Key considerations:
- MonthlyPartitionsDefinition generates one partition per calendar month
- The partition key is accessible via context.partition_key within the asset function
- Backfill operations can be triggered from the Dagster UI for historical partition ranges
- Idempotent patterns (DELETE then INSERT) ensure partitions can be safely re-materialized
Step 7: Pipeline Automation
Configure declarative automation to schedule and trigger asset materialization. Ingestion assets run on a daily cron schedule using AutomationCondition.on_cron(), while downstream transformation and analytics assets trigger reactively using AutomationCondition.eager() when their upstream dependencies complete.
Key considerations:
- Declarative automation is defined directly on asset definitions, not as separate schedule objects
- Cron-based conditions suit periodic source data refreshes
- Eager conditions propagate materializations through the dependency graph automatically
- The Dagster daemon must be running for automation to execute