Principle:Dagster io Dagster Software Defined Assets
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, Orchestration |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Software-defined assets are the core abstraction in Dagster, representing data artifacts (tables, files, ML models) as first-class objects with known dependencies, computation functions, and metadata.
Description
Software-defined assets combine an asset key (identity), a computation function, and upstream dependencies into a single declarative unit. Unlike traditional task-based orchestration, assets declare what data they produce rather than what steps to run. Dependencies between assets are inferred from function parameters, enabling the framework to automatically construct the execution graph.
Each software-defined asset encapsulates three core elements:
- Asset Key: A unique identifier for the data artifact (e.g., a database table name, file path, or logical data product name).
- Computation Function: The Python function that produces or updates the asset when executed.
- Upstream Dependencies: References to other assets that must be materialized before this asset can be computed.
This combination allows Dagster to provide automatic lineage tracking, incremental computation, and declarative automation without requiring users to manually wire together execution steps.
Usage
Use software-defined assets when modeling any data pipeline where outputs have meaningful identity. This includes database tables, files in object storage, ML models, feature tables, and any other data artifact that is produced by computation and consumed by downstream processes. Software-defined assets are the fundamental building block of all Dagster pipelines and should be the default choice for representing data transformations.
Theoretical Basis
The asset-centric model inverts the traditional DAG-of-tasks paradigm. In a task-based system, users define a directed acyclic graph of operations (Extract, Transform, Load) and wire them together explicitly. In an asset-centric system, users define data products and their dependencies, and the orchestrator infers the execution plan from the declared asset graph.
This inversion provides several theoretical advantages:
- Declarative Semantics: The pipeline specification describes the desired state of data rather than the procedure to achieve it.
- Automatic Lineage: Because dependencies are declared at the data level, the system can trace the full provenance of any asset.
- Incremental Computation: The framework can determine which assets need re-computation based on upstream changes, avoiding unnecessary work.
- Idempotency: Asset materializations are designed to be idempotent -- re-running the same asset with the same inputs produces the same output.
The following pseudocode illustrates the conceptual model:
# Traditional task-based approach
task_extract >> task_transform >> task_load
# Asset-centric approach
asset("raw_data") # declares what is produced
asset("clean_data", # declares dependency on raw_data
deps=["raw_data"])
asset("summary", # declares dependency on clean_data
deps=["clean_data"])
# Execution order is inferred automatically