Principle:Astronomer Astronomer cosmos Cloud Documentation Upload
Metadata
| Field | Value |
|---|---|
| Page Type | Principle |
| Knowledge Sources | Doc (Cosmos Generating Docs), Repo (astronomer-cosmos) |
| Domains | Data_Engineering, Documentation, Cloud_Storage |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
A principle for uploading generated dbt documentation artifacts to cloud object storage for persistent hosting.
Description
After generating dbt documentation, the artifacts must be stored in a durable, accessible location. Cloud object storage (S3, GCS, Azure Blob) provides a standard hosting mechanism. The upload step transfers index.html, manifest.json, and catalog.json to a specified bucket/container, where they can be served via a plugin or direct URL access.
The Generate-Then-Upload Pattern
The documentation workflow follows a two-phase pattern:
- Generate phase: The
dbt docs generatecommand runs on the Airflow worker, connecting to the data warehouse to introspect schemas and produce the documentation artifacts in a local target directory. - Upload phase: The generated files are transferred from the worker's local filesystem to a cloud storage bucket. This ensures the documentation persists beyond the lifecycle of the Airflow task and is accessible to the broader team.
This separation of concerns allows each phase to be configured independently. The generation step requires database credentials, while the upload step requires cloud storage credentials. Different cloud providers can be targeted without modifying the generation logic.
Supported Cloud Providers
The upload mechanism supports three major cloud storage providers:
- Amazon S3: Files are uploaded to an S3 bucket using the AWS SDK (boto3) via an Airflow AWS connection. The S3Hook handles authentication, region configuration, and multipart uploads.
- Google Cloud Storage (GCS): Files are uploaded to a GCS bucket using the Google Cloud SDK via an Airflow Google Cloud connection. The GCSHook manages service account authentication and project configuration.
- Azure Blob Storage: Files are uploaded to an Azure Blob container using the Azure SDK via an Airflow WASB connection. The WasbHook handles storage account authentication and container management.
Folder Organization
An optional folder directory parameter allows organizing documentation within the bucket. This enables:
- Versioned documentation (e.g.,
docs/v1/,docs/v2/) - Multi-project documentation (e.g.,
project_a/,project_b/) - Environment-specific docs (e.g.,
dev/,prod/)
Usage
Use cloud documentation upload when:
- Team-wide access is required: dbt documentation needs to be accessible beyond the Airflow worker's local filesystem, typically for team-wide access via a web interface.
- Persistent hosting: Documentation must survive Airflow worker restarts, task retries, and ephemeral compute environments (e.g., Kubernetes pods).
- Integration with Airflow UI: The Cosmos plugin reads documentation from cloud storage to serve it through the Airflow web interface, requiring the artifacts to be in a cloud bucket.
- Multi-environment documentation: Different Airflow environments (dev, staging, production) each generate and upload their own documentation to separate storage paths.
Theoretical Basis
The upload follows a generate-then-store pattern that is common in CI/CD and data pipeline architectures. Cloud storage provides durable, scalable, and cost-effective hosting for static documentation files. Each cloud provider has its own hook/SDK for authenticated file transfer, but the abstract pattern remains the same:
- Enumerate the required documentation files in the local target directory.
- For each file, construct the target cloud storage path using the bucket name and optional folder prefix.
- Upload the file using the provider-specific SDK, authenticated via the Airflow connection.
- Verify that the upload completed successfully.
This pattern leverages cloud storage's built-in features:
- Durability: Cloud storage services provide 99.999999999% (11 nines) durability for stored objects.
- Availability: Objects are accessible via HTTP/HTTPS endpoints.
- Scalability: No capacity planning is required; storage scales automatically.
- Cost efficiency: Static documentation files are small (typically under 10 MB) and infrequently accessed, resulting in minimal storage costs.