Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Astronomer Astronomer cosmos Graph Parsing and Task Generation

From Leeroopedia


Overview

A core orchestration principle for parsing a dbt project's dependency graph and generating corresponding orchestration tasks with correct dependency wiring. This two-phase process is the engine behind all Cosmos rendering patterns.

Description

The graph parsing and task generation principle operates in two distinct phases:

Phase 1: Graph Loading

The first phase discovers the dbt project's node structure and dependency relationships. Cosmos supports multiple loading strategies, selected via the LoadMode enum:

  • AUTOMATIC -- Cosmos selects the best available method based on the environment.
  • DBT_LS -- runs dbt ls as a subprocess to discover nodes. This is the most accurate method but requires a dbt installation and database connectivity at parse time.
  • DBT_MANIFEST -- parses a pre-built manifest.json file. Fast and does not require dbt at parse time, but the manifest must be kept in sync with the project.
  • DBT_LS_FILE -- reads a previously saved dbt ls output from a file.
  • DBT_LS_CACHE -- uses a cached dbt ls result, refreshing periodically.
  • CUSTOM -- invokes a user-provided callback to supply nodes.

Each strategy produces the same output: a dictionary of DbtNode objects keyed by unique ID, along with their dependency relationships.

Phase 2: Task Generation

The second phase maps each discovered dbt node to an Airflow operator and wires task dependencies:

  1. Node filtering -- nodes are filtered based on the RenderConfig (select, exclude, resource types).
  2. Operator selection -- each node's resource type (model, test, seed, snapshot) determines which operator class is used. The execution mode (local, Docker, Kubernetes, etc.) further refines the operator choice.
  3. Task instantiation -- operators are created with the appropriate arguments, including any operator_args broadcast settings.
  4. Dependency wiring -- upstream/downstream relationships from the dbt graph are translated into Airflow task dependencies using the >> operator.

The result is a set of Airflow tasks with dependencies that mirror the dbt project's DAG.

Usage

This two-phase process happens automatically inside DbtDag and DbtTaskGroup. Understanding it is critical for:

  • Debugging graph loading issues -- if tasks are missing or incorrectly ordered, the problem usually lies in Phase 1 (wrong load mode, stale manifest, missing dbt connectivity).
  • Customizing node-to-operator mapping -- advanced users can influence which operators are chosen by adjusting execution mode or using custom load callbacks.
  • Performance tuning -- choosing the right LoadMode affects DAG parse time. DBT_MANIFEST is fastest; DBT_LS is most accurate but slowest.
  • Understanding filtering -- the RenderConfig controls which dbt nodes become Airflow tasks via select and exclude parameters.

Theoretical Basis

dbt projects define a directed acyclic graph (DAG) of nodes:

  • Nodes include models, tests, seeds, snapshots, and sources.
  • Edges represent data dependencies declared via ref() and source() macros.

The graph parsing and task generation principle performs a topological mapping from this dbt graph to an Airflow task graph:

  • Each dbt node is mapped to an Airflow operator (preserving node identity).
  • Each dbt edge is mapped to an Airflow task dependency (preserving execution order).
  • The mapping is structure-preserving -- the Airflow task graph is isomorphic to the (filtered) dbt node graph.

This topological mapping guarantees that:

  • No task runs before its dependencies -- the Airflow scheduler enforces the same ordering constraints as dbt.
  • Maximum parallelism -- independent branches of the dbt graph execute concurrently.
  • Incremental retries -- a failed model can be retried without re-running its upstream dependencies.

The two-phase design separates discovery (what nodes exist) from construction (how to build tasks), enabling different loading strategies without changing the task generation logic.

Related Pages

Implemented By

Uses Heuristic

Knowledge Sources

Domains

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment