Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Astronomer Astronomer cosmos Virtualenv Operator Execution

From Leeroopedia


Knowledge Sources
Domains Execution, Isolation
Last Updated 2026-02-07 17:00 GMT

Overview

Executing data transformation commands inside isolated Python virtual environments to prevent dependency conflicts between the orchestrator and the transformation tool.

Description

What it is: Virtualenv Operator Execution is a principle in which each dbt command runs inside a dedicated Python virtual environment rather than sharing the orchestrator's (Airflow's) global Python installation. A base operator class (DbtVirtualenvBaseOperator) encapsulates the lifecycle of creating, configuring, executing within, and cleaning up that virtual environment for every dbt invocation.

What problem it solves: Apache Airflow and dbt Core are both Python applications, but they frequently require mutually incompatible versions of shared transitive dependencies (e.g., different versions of Jinja2, markupsafe, or database adapter libraries). Installing both into the same environment leads to version conflicts, broken imports, and silent behavioural regressions. Virtualenv execution eliminates this class of failure entirely by guaranteeing that dbt runs in a purpose-built environment with its own pinned dependency tree.

Where it fits: This principle sits within Cosmos's Execution layer. It is one of several execution strategies (alongside Local, Docker, ECS, Kubernetes, and Airflow Async). The user selects it by setting execution_mode = ExecutionMode.VIRTUALENV when constructing a Cosmos DAG or task group. It reuses Airflow's built-in prepare_virtualenv utility and produces concrete operator subclasses for every dbt command (run, build, test, seed, snapshot, source, ls, docs, run-operation, clone).

Usage

Use Virtualenv Operator Execution when:

  • The Airflow environment and the required dbt version have conflicting Python dependencies.
  • You need to test or run multiple dbt versions across different DAGs on the same Airflow cluster.
  • Full container isolation (Docker, ECS, Kubernetes) is not justified by the workload but you still need dependency safety.
  • You want the simplicity of local file-system access to dbt project files without the overhead of building and pushing container images.

Avoid this principle when latency-sensitive scheduling prohibits the overhead of creating a virtual environment on every task execution (consider persisting the virtualenv directory in that case) or when the dbt adapter has non-Python system-level dependencies that cannot be satisfied by pip install alone.

Theoretical Basis

The core mechanism rests on four pillars:

1. Virtual environment lifecycle management. Before every dbt command, the operator calls Airflow's prepare_virtualenv function, which creates (or reuses) a venv directory, installs the declared py_requirements with optional pip_install_options, and returns the path to the resulting Python binary. The operator then rewrites the dbt command to use that binary's sibling dbt entry point, ensuring all imports resolve against the isolated site-packages.

2. Lock-based shared virtualenv access. When a persistent virtualenv_dir is configured and shared across parallel tasks, concurrent creation or mutation of the environment would corrupt it. Cosmos addresses this with a PID-based file lock (cosmos_virtualenv.lock). An operator acquires the lock before preparing the virtualenv, holds it during installation, and releases it before command execution. If the lock is held by another process, the operator retries with a configurable maximum wait (virtualenv_max_retries_lock, default 120 seconds at one-second intervals). Stale locks from crashed processes are detected via psutil.Process.is_running() checks.

3. Temporary versus persistent environments. Two modes are supported. In temporary mode (the default when no virtualenv_dir is supplied, or when is_virtualenv_dir_temporary=True), a TemporaryDirectory is created, used for a single execution, and deleted on completion or kill. In persistent mode, the virtualenv directory survives across task runs, amortising the cost of pip install across many invocations while the lock protocol ensures safe concurrent access.

4. Subprocess invocation mode. Regardless of the user's global invocation mode preference, the virtualenv operator forces InvocationMode.SUBPROCESS. This is essential because the dbt process must run under the isolated Python binary rather than being imported into the Airflow worker's own interpreter. The operator overrides the command's first element to point at the virtualenv's dbt executable.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment