Heuristic:Astronomer Astronomer cosmos Cache Strategy Optimization

Knowledge Sources	astronomer-cosmos Caching Configuration
Domains	Optimization, Performance
Last Updated	2026-02-07 17:00 GMT

Overview

Multi-layer caching architecture that caches dbt ls output, YAML selectors, partial parse artifacts, profiles, and package lockfiles to minimize DAG parse time.

Description

Cosmos implements a five-layer caching strategy to avoid redundant dbt project parsing during Airflow DAG scheduling. Each cache layer targets a different bottleneck in the dbt-to-Airflow rendering pipeline:

dbt ls cache: Caches the output of dbt ls commands to avoid re-running them on every DAG parse cycle
YAML selector cache: Caches parsed YAML selector configurations
Partial parse cache: Preserves dbt's incremental partial_parse.msgpack file between runs
Profile cache: Caches generated dbt profiles to avoid re-creating them
Package lockfile cache: Caches package-lock.yml to skip dbt deps when packages haven't changed

Cache identifiers are based on the DAG/TaskGroup location (not dbt project path) to prevent concurrency issues when multiple DAGs share the same dbt project on the same node. Cache invalidation uses file hashing (MD5 of sorted JSON) to detect project changes, with measured overhead of 0.01s for small projects and 0.135s for 5,000-model projects.

Usage

Use this heuristic when optimizing Cosmos DAG parse performance. All caches are enabled by default. Disable individual layers only when debugging stale-state issues or when running in environments with shared filesystem concerns.

The Insight (Rule of Thumb)

Action: Keep all caches enabled (defaults). Set enable_cache = True globally.
Value: Default TTL is 30 days since last DAG execution.
Trade-off: Stale cache can cause missed model changes. Cache cleanup requires explicit maintenance (via cleanup DAG or manual deletion).
Cleanup: Use delete_unused_dbt_ls_cache() with a maintenance DAG to prune caches for DAGs that are no longer active.

Reasoning

DAG parsing happens frequently in Airflow (every min_file_process_interval seconds). Without caching, each parse cycle runs dbt ls (which loads the full dbt project), parses selectors, generates profiles, and resolves dependencies. For large dbt projects (1000+ models), this can add 30+ seconds per parse cycle, causing scheduler lag.

The cache identifier design intentionally uses the DAG/TaskGroup location rather than the dbt project path:

# From cosmos/cache.py:110-114
# It was considered to create a cache identifier based on the dbt project path, as opposed
# to where it is used in Airflow. However, we could have concurrency issues if the same
# dbt cached directory was being used by different dbt task groups or DAGs within the same
# node. For this reason, as a starting point, the cache is identified by where it is used.

Performance measurement from cosmos/cache.py:294:

# This is fast (e.g. 0.01s for jaffle shop, 0.135s for a 5k models dbt folder)

Cache key generation from cosmos/cache.py:310-332:

@functools.lru_cache
def was_project_modified(project_dir: Path, cache_dir: Path) -> bool:
    ...
    current_hash = calculate_directory_hash(project_dir)
    cached_hash = read_cached_hash(cache_dir)
    return current_hash != cached_hash

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment