Principle:Apache Airflow DAG File Discovery
| Knowledge Sources | |
|---|---|
| Domains | DAG_Processing, Scheduling |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
A continuous file discovery and parsing process that transforms DAG source files into scheduler-consumable representations.
Description
DAG File Discovery is the scheduler-side perspective of how DAG files are found and loaded. Unlike DAG Deployment (which focuses on the dag-processor component writing to the database), this principle covers how the scheduler discovers available DAGs, tracks file modifications, manages parsing parallelism, and handles stale DAG cleanup. The DagFileProcessorManager coordinates this process with configurable parallelism, timeouts, and re-parse intervals.
Usage
This principle applies in the context of scheduler operation. The scheduler relies on the dag-processor to continuously discover and parse DAG files, making them available for scheduling decisions. Understanding this process is essential for troubleshooting DAG visibility issues and parsing delays.
Theoretical Basis
Discovery Loop:
- Bundle Scan: Enumerate files in all configured DAG bundles
- Filter: Apply safe_mode filtering and .airflowignore rules
- Priority Queue: Order files by last parse time and modification status
- Parallel Dispatch: Send files to worker processes (bounded by _parallelism)
- Result Collection: Gather parsed DAGs and import errors via I/O multiplexing
- Stale Cleanup: Remove DAGs not seen within stale_dag_threshold