Heuristic:Astronomer Astronomer cosmos Cache Strategy Optimization
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Performance |
| Last Updated | 2026-02-07 17:00 GMT |
Overview
Multi-layer caching architecture that caches dbt ls output, YAML selectors, partial parse artifacts, profiles, and package lockfiles to minimize DAG parse time.
Description
Cosmos implements a five-layer caching strategy to avoid redundant dbt project parsing during Airflow DAG scheduling. Each cache layer targets a different bottleneck in the dbt-to-Airflow rendering pipeline:
- dbt ls cache: Caches the output of
dbt lscommands to avoid re-running them on every DAG parse cycle - YAML selector cache: Caches parsed YAML selector configurations
- Partial parse cache: Preserves dbt's incremental
partial_parse.msgpackfile between runs - Profile cache: Caches generated dbt profiles to avoid re-creating them
- Package lockfile cache: Caches
package-lock.ymlto skipdbt depswhen packages haven't changed
Cache identifiers are based on the DAG/TaskGroup location (not dbt project path) to prevent concurrency issues when multiple DAGs share the same dbt project on the same node. Cache invalidation uses file hashing (MD5 of sorted JSON) to detect project changes, with measured overhead of 0.01s for small projects and 0.135s for 5,000-model projects.
Usage
Use this heuristic when optimizing Cosmos DAG parse performance. All caches are enabled by default. Disable individual layers only when debugging stale-state issues or when running in environments with shared filesystem concerns.
The Insight (Rule of Thumb)
- Action: Keep all caches enabled (defaults). Set
enable_cache = Trueglobally. - Value: Default TTL is 30 days since last DAG execution.
- Trade-off: Stale cache can cause missed model changes. Cache cleanup requires explicit maintenance (via cleanup DAG or manual deletion).
- Cleanup: Use
delete_unused_dbt_ls_cache()with a maintenance DAG to prune caches for DAGs that are no longer active.
Reasoning
DAG parsing happens frequently in Airflow (every min_file_process_interval seconds). Without caching, each parse cycle runs dbt ls (which loads the full dbt project), parses selectors, generates profiles, and resolves dependencies. For large dbt projects (1000+ models), this can add 30+ seconds per parse cycle, causing scheduler lag.
The cache identifier design intentionally uses the DAG/TaskGroup location rather than the dbt project path:
# From cosmos/cache.py:110-114
# It was considered to create a cache identifier based on the dbt project path, as opposed
# to where it is used in Airflow. However, we could have concurrency issues if the same
# dbt cached directory was being used by different dbt task groups or DAGs within the same
# node. For this reason, as a starting point, the cache is identified by where it is used.
Performance measurement from cosmos/cache.py:294:
# This is fast (e.g. 0.01s for jaffle shop, 0.135s for a 5k models dbt folder)
Cache key generation from cosmos/cache.py:310-332:
@functools.lru_cache
def was_project_modified(project_dir: Path, cache_dir: Path) -> bool:
...
current_hash = calculate_directory_hash(project_dir)
cached_hash = read_cached_hash(cache_dir)
return current_hash != cached_hash