Principle:Astronomer Astronomer cosmos Dbt Documentation Generation
Metadata
| Field | Value |
|---|---|
| Page Type | Principle |
| Knowledge Sources | Doc (dbt Docs), Repo (astronomer-cosmos) |
| Domains | Data_Engineering, Documentation |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
A principle for generating dbt project documentation artifacts (catalog, manifest, index) as part of an orchestrated pipeline.
Description
dbt provides a dbt docs generate command that introspects the data warehouse schema and generates documentation artifacts. These artifacts form a self-contained documentation site:
- index.html: A browsable single-page application that renders the documentation UI. This is the entry point for viewing docs in a browser.
- manifest.json: A comprehensive metadata file describing the entire dbt project, including models, sources, tests, exposures, macros, and their relationships. It captures the project DAG, column descriptions, and configuration details.
- catalog.json: Schema information obtained by querying the data warehouse's information schema. Contains column names, data types, row counts, and other database-level metadata for each relation materialized by the dbt project.
When integrated into an orchestration pipeline such as Apache Airflow, the documentation generation step typically precedes an upload step for hosting. This ensures that documentation stays current with each pipeline run as models evolve. The generation process connects to the target database using the configured dbt profile to build the catalog, while the manifest is derived from the project files themselves.
Static Documentation
dbt also supports a --static flag that bundles all artifacts into a single static_index.html file. This self-contained file includes the manifest and catalog data inline, eliminating the need to serve multiple files. This is useful for simplified hosting scenarios.
Artifact Lifecycle
The generated artifacts are written to the dbt project's target/ directory by default. In an orchestration context, this directory is typically a temporary workspace on the Airflow worker. The artifacts must be explicitly transferred to a durable storage location (cloud object storage, a shared filesystem, or an artifact repository) to be accessible after the task completes.
Usage
Use dbt documentation generation when:
- Data models evolve frequently: If the schema changes with each pipeline run (new columns, renamed models, updated descriptions), regenerating docs ensures documentation stays synchronized with the actual data warehouse state.
- Scheduled documentation pipelines: Automatically regenerate documentation as part of a nightly or per-deployment pipeline, so the team always has access to up-to-date docs without manual intervention.
- Compliance and audit requirements: Maintaining current documentation of the data warehouse schema as part of data governance and audit trail requirements.
- Onboarding and self-service analytics: Providing analysts and data consumers with a browsable catalog of available data assets, their descriptions, column types, and lineage.
Theoretical Basis
The dbt docs generate command operates in two phases:
- Manifest construction: dbt parses the project files (models, sources, seeds, snapshots, tests, macros) and builds a directed acyclic graph (DAG) representing the project. This includes all user-provided documentation strings (descriptions in YAML files and doc blocks), column-level metadata, and configuration. The manifest is generated without database access.
- Catalog construction: dbt connects to the target database using the active profile and runs introspection queries against the information schema. For each relation materialized by the project, it retrieves column names, data types, and (where supported) statistics such as row counts and byte sizes. The catalog provides the ground-truth schema information that complements the project-level metadata in the manifest.
The resulting artifacts are designed to be consumed together by the index.html application, which renders an interactive documentation site with search, lineage visualization, and per-model detail pages.