Workflow:Datahub project Datahub CLI Metadata Ingestion

Knowledge Sources	DataHub DataHub Docs CLI Ingestion Guide
Domains	Data_Engineering, Metadata_Management, ETL
Last Updated	2026-02-09 12:00 GMT

Overview

End-to-end process for batch metadata ingestion from external data sources into DataHub using the CLI and YAML recipe files.

Description

This workflow covers the standard procedure for extracting metadata from data sources (databases, data warehouses, BI tools, orchestrators) and loading it into DataHub. It uses a declarative YAML-based recipe configuration that defines a source connector, optional transformers, and a sink. The CLI orchestrates the extraction, transformation, and emission pipeline. Recipes support environment variable substitution for secure credential management, and can be scheduled via Airflow or CRON for recurring ingestion.

Usage

Execute this workflow when you need to populate DataHub with metadata from an external system such as a database (MySQL, Postgres, Snowflake), data warehouse (BigQuery, Redshift), BI tool (Looker, Tableau), or orchestrator (Airflow). This is the primary batch ingestion method and is suitable for scheduled, periodic metadata extraction.

Execution Steps

Step 1: Install DataHub CLI

Install the DataHub CLI Python package, which provides the ingestion framework and connector plugins. The base package includes core functionality; source-specific plugins are installed as extras (e.g., mysql, bigquery, snowflake).

Key considerations:

Requires Python 3.8+
Install source-specific extras for the target connector
Verify installation with the version command

Step 2: Create Recipe Configuration

Author a YAML recipe file that defines the complete ingestion pipeline. The recipe specifies three main sections: source (connector type and credentials), sink (DataHub REST endpoint), and optional transformers.

Key considerations:

Each recipe handles exactly one source and one sink
Use environment variable substitution for secrets and credentials
Use the .dhub.yaml extension for IDE autocomplete support
Transformers can add tags, owners, glossary terms, or modify metadata in transit

Step 3: Configure Authentication

Set up authentication credentials for both the source system and the DataHub server. Source credentials are defined in the recipe; DataHub authentication uses a personal access token.

Key considerations:

Generate a DataHub API token from the Settings page in the UI
Store credentials in environment variables, not directly in recipe files
Some sources support service account or IAM-based authentication

Step 4: Execute Ingestion

Run the ingestion pipeline via the CLI command. The CLI reads the recipe, connects to the source, extracts metadata, applies transformers, and pushes results to the DataHub sink.

Key considerations:

Monitor CLI output for extraction progress and errors
Failed records are logged but do not stop the pipeline
Use dry-run mode to validate configuration without emitting metadata

Step 5: Schedule Recurring Ingestion

Configure automated scheduling for periodic metadata refresh. Daily ingestion is recommended for most sources to keep metadata current.

Key considerations:

Apache Airflow is the recommended scheduler (BashOperator or PythonOperator)
CRON can be used as a simpler alternative
DataHub UI ingestion provides a built-in scheduling interface

Execution Diagram

GitHub URL

Workflow Repository