Workflow:Datahub project Datahub Metadata Ingestion Pipeline
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, Metadata_Management, ETL |
| Last Updated | 2026-02-09 17:00 GMT |
Overview
End-to-end process for extracting metadata from external data sources and ingesting it into a DataHub instance using the Python-based ingestion framework.
Description
This workflow covers the standard procedure for pulling metadata from data platforms (databases, BI tools, data warehouses, etc.) and loading it into DataHub. It uses a YAML-based recipe configuration that defines a source connector, optional transformers, and a sink destination. The framework supports over 50 source connectors and can be executed via the CLI, programmatic Python API, or the DataHub UI ingestion tab.
Usage
Execute this workflow when you need to catalog metadata from an external data platform into DataHub. Typical triggers include onboarding a new data source, scheduling recurring metadata refresh, or performing a one-time metadata import. You should have access credentials for the source system and a running DataHub instance.
Execution Steps
Step 1: Install the CLI and Source Connector
Install the DataHub CLI Python package and the appropriate connector plugin for your data source. Each source type has a dedicated plugin that includes the necessary drivers and dependencies.
Key considerations:
- Requires Python 3.8+
- Each source connector is installed as an extras package (e.g., bigquery, mysql, looker)
- Virtual environments are recommended to isolate dependencies
Step 2: Configure the Ingestion Recipe
Create a YAML recipe file that defines three sections: source configuration, optional transformers, and sink configuration. The source section specifies the connector type and connection credentials. The sink section defaults to DataHub REST API.
Key considerations:
- Use environment variable substitution for secrets and credentials
- Each recipe pairs exactly one source with one sink
- The transformers section can add tags, ownership, or modify metadata in transit
- The special directive __DATAHUB_TO_FILE_ handles complex configs like SSL certificates
Step 3: Validate the Recipe and Test Connection
Verify the recipe YAML syntax and test connectivity to both the source system and the DataHub instance before running a full ingestion. This catches credential issues and network problems early.
Key considerations:
- Use the CLI test command to validate connectivity
- Check that the DataHub GMS endpoint is accessible
- Verify source system credentials have read access to metadata
Step 4: Execute the Ingestion
Run the ingestion pipeline using the CLI command with the recipe file. The framework connects to the source, extracts metadata (datasets, schemas, lineage, ownership, tags), applies any transformers, and sends the results to the configured sink.
Key considerations:
- Ingestion can be run as a one-time batch or scheduled for recurring execution
- The CLI provides progress output and a summary report
- Failed records are logged but do not halt the entire pipeline
- The UI ingestion tab provides an alternative graphical interface
Step 5: Verify Ingested Metadata
After ingestion completes, verify that the expected entities appear in the DataHub UI. Check that datasets, schemas, lineage relationships, and governance metadata (tags, owners, glossary terms) are correctly represented.
Key considerations:
- Browse the DataHub UI to confirm entity counts
- Verify schema fields and lineage edges
- Check the ingestion run summary for any warnings or errors