Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Astronomer Astronomer cosmos Cloud Storage Connection Setup

From Leeroopedia


Metadata

Field Value
Page Type Principle
Knowledge Sources Doc (Airflow Connections), Repo (astronomer-cosmos)
Domains Configuration, Cloud_Storage, Security
Last Updated 2026-02-07 14:00 GMT

Overview

A configuration principle for establishing authenticated connections to cloud storage services within an orchestration system.

Description

Before dbt documentation can be uploaded to or served from cloud storage, the orchestration system must have authenticated access to the storage service. This involves configuring provider-specific credentials (AWS access keys, GCP service accounts, Azure storage keys) as Airflow connections that operators and hooks can reference by connection ID.

Connection as Abstraction

Airflow connections provide a uniform abstraction over heterogeneous authentication mechanisms. A connection encapsulates:

  • Connection type: Identifies the provider and determines which hook class processes the connection (e.g., aws for S3, google_cloud_platform for GCS, wasb for Azure Blob Storage).
  • Connection ID: A user-defined string that operators reference to obtain credentials at runtime. This decouples operator configuration from credential management.
  • Authentication parameters: Provider-specific fields such as access keys, service account JSON, storage account names, and SAS tokens.

This abstraction allows the same operator code to work across environments (development, staging, production) by simply changing the connection ID or the credentials behind it.

Provider-Specific Connections

Each cloud provider requires different authentication parameters:

  • Amazon Web Services (S3):
    • Connection type: aws or Amazon Web Services
    • Key fields: aws_access_key_id, aws_secret_access_key, region_name
    • Alternative: IAM role-based authentication via instance profiles or EKS service accounts (no explicit keys needed)
    • Default connection ID: aws_default
  • Google Cloud Platform (GCS):
    • Connection type: google_cloud_platform
    • Key fields: keyfile_dict (service account JSON) or keyfile_path (path to service account JSON file)
    • Alternative: Workload Identity on GKE (no explicit keys needed)
    • Default connection ID: google_cloud_default
  • Microsoft Azure (Blob Storage):
    • Connection type: wasb or Azure Blob Storage
    • Key fields: login (storage account name), password (storage account key or SAS token)
    • Alternative: Managed Identity authentication
    • Default connection ID: wasb_default

Credential Storage

Airflow supports multiple credential storage backends:

  • Metadata database: Connections stored directly in the Airflow database (default). Credentials can be encrypted at rest using a Fernet key.
  • Environment variables: Connections defined as environment variables in the format AIRFLOW_CONN_{CONN_ID}.
  • External secrets backends: Integration with HashiCorp Vault, AWS Secrets Manager, GCP Secret Manager, Azure Key Vault, and other secret management systems.

Usage

Use cloud storage connection setup when:

  • Setting up dbt docs hosting: Before configuring DbtDocsCloudLocalOperator subclasses or the Cosmos plugin, the corresponding cloud storage connection must exist.
  • Initial Airflow deployment: As part of the infrastructure-as-code setup for a new Airflow environment, connections to cloud services are provisioned alongside DAG deployments.
  • Credential rotation: When cloud provider access keys are rotated, the Airflow connection must be updated to maintain access.
  • Multi-cloud setups: When documentation is uploaded to one provider but served from another, multiple connections may need to be configured.
  • Any scenario requiring authenticated access: Any use of S3, GCS, or Azure Blob Storage from Airflow operators or hooks requires a properly configured connection.

Theoretical Basis

Airflow's connection management provides a centralized credential store. Connections are typed by provider (aws, google_cloud_platform, wasb) and store the authentication parameters needed by the corresponding provider hooks.

Hook Resolution

When an operator needs to interact with a cloud service, it:

  1. Instantiates the appropriate hook class (e.g., S3Hook, GCSHook, WasbHook).
  2. Passes the connection ID to the hook constructor.
  3. The hook retrieves the connection details from Airflow's connection store.
  4. The hook uses the stored credentials to authenticate with the cloud provider's SDK.

This indirection allows:

  • Credential isolation: Operators never handle raw credentials; they only reference connection IDs.
  • Centralized management: All credentials are managed in a single location (the Airflow metadata database or external secrets backend).
  • Environment portability: The same DAG code works across environments by maintaining connections with the same IDs but different credentials.

Security Considerations

  • Principle of least privilege: Cloud storage credentials should grant only the minimum permissions required (e.g., s3:PutObject and s3:GetObject for the specific bucket, not full S3 access).
  • Encryption at rest: Airflow encrypts connection passwords using a Fernet key when stored in the metadata database. The Fernet key itself should be stored securely.
  • Credential rotation: Regularly rotating access keys and updating the Airflow connection reduces the blast radius of credential compromise.
  • External secrets backends: For production deployments, using an external secrets backend (Vault, AWS Secrets Manager) provides audit logging, automatic rotation, and centralized secret management.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment