Heuristic:Apache Airflow DAG Top Level Code Avoidance
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Scheduling |
| Last Updated | 2026-02-08 20:00 GMT |
Overview
Avoid placing expensive operations (database calls, network requests, heavy imports) at DAG file top-level to prevent scheduler parsing bottlenecks.
Description
The Airflow scheduler periodically re-parses every DAG file at an interval defined by `min_file_process_interval`. Any Python code at the top level of a DAG file (outside of operator callables) executes on every single parse cycle. This means that database queries, API calls, heavy library imports (pandas, tensorflow, torch), and any expensive computation at the top level directly multiplies the scheduler's parsing time. A single `time.sleep(1000)` at the top level adds 1000 seconds to each parse cycle for that file.
Usage
Apply this heuristic when you observe slow DAG parsing times, scheduler lag, or high CPU usage on the scheduler process. The quickest diagnostic is to run `time python your-dag-file.py` from the command line — if it takes more than a few seconds, top-level code is the likely culprit.
The Insight (Rule of Thumb)
- Action: Move all expensive operations inside operator callables (e.g., inside `python_callable` functions or `execute()` methods). Keep top-level code limited to DAG/task definitions.
- Value: DAG parsing time should ideally be under 1 second per file. Aim for the minimal import footprint.
- Trade-off: May require restructuring DAG code to use lazy imports or Jinja templates instead of direct Python calls.
- Diagnostic: Run `time python your-dag-file.py` — subtract ~0.07s Python startup overhead to get actual parsing time.
What to AVOID at top level:
- Database access (`Variable.get()`, `Connection.get()`, direct SQL)
- Networking operations (API calls, HTTP requests)
- Heavy imports (`import pandas`, `import tensorflow`)
- Expensive computations (data processing, file parsing)
What is SAFE at top level:
- DAG instantiation (`with DAG(...)`)
- Task/operator definitions
- Simple variable assignments
- Standard library imports
Reasoning
The scheduler processes DAG files in a loop controlled by `min_file_process_interval` (default: 30 seconds). With hundreds of DAG files, the total parse time for all files must complete within this interval to avoid scheduling lag. Each file's top-level code runs on every parse, creating a multiplicative effect. The scheduler uses `dag_processor__parsing_processes` parallel workers, but each worker still executes top-level code sequentially within a file.
Evidence from `airflow-core/docs/best-practices.rst:97-179`:
# BAD - runs on every parse cycle (every 30 seconds!)
import pandas as pd # Heavy import
data = pd.read_csv("/path/to/large/file.csv") # File I/O at top level
# GOOD - runs only when the task executes
def process_data():
import pandas as pd
data = pd.read_csv("/path/to/large/file.csv")
# ... process data
A DAG file with a 1000-second top-level operation causes:
- 1000 seconds per parse cycle
- Blocks one of `parsing_processes` workers for that entire duration
- Other DAGs queued behind it in the same worker are delayed