Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Datahub project Datahub CLI Metadata Ingestion

From Leeroopedia


Knowledge Sources
Domains Data_Engineering, Metadata_Management, ETL
Last Updated 2026-02-09 12:00 GMT

Overview

End-to-end process for batch metadata ingestion from external data sources into DataHub using the CLI and YAML recipe files.

Description

This workflow covers the standard procedure for extracting metadata from data sources (databases, data warehouses, BI tools, orchestrators) and loading it into DataHub. It uses a declarative YAML-based recipe configuration that defines a source connector, optional transformers, and a sink. The CLI orchestrates the extraction, transformation, and emission pipeline. Recipes support environment variable substitution for secure credential management, and can be scheduled via Airflow or CRON for recurring ingestion.

Usage

Execute this workflow when you need to populate DataHub with metadata from an external system such as a database (MySQL, Postgres, Snowflake), data warehouse (BigQuery, Redshift), BI tool (Looker, Tableau), or orchestrator (Airflow). This is the primary batch ingestion method and is suitable for scheduled, periodic metadata extraction.

Execution Steps

Step 1: Install DataHub CLI

Install the DataHub CLI Python package, which provides the ingestion framework and connector plugins. The base package includes core functionality; source-specific plugins are installed as extras (e.g., mysql, bigquery, snowflake).

Key considerations:

  • Requires Python 3.8+
  • Install source-specific extras for the target connector
  • Verify installation with the version command

Step 2: Create Recipe Configuration

Author a YAML recipe file that defines the complete ingestion pipeline. The recipe specifies three main sections: source (connector type and credentials), sink (DataHub REST endpoint), and optional transformers.

Key considerations:

  • Each recipe handles exactly one source and one sink
  • Use environment variable substitution for secrets and credentials
  • Use the .dhub.yaml extension for IDE autocomplete support
  • Transformers can add tags, owners, glossary terms, or modify metadata in transit

Step 3: Configure Authentication

Set up authentication credentials for both the source system and the DataHub server. Source credentials are defined in the recipe; DataHub authentication uses a personal access token.

Key considerations:

  • Generate a DataHub API token from the Settings page in the UI
  • Store credentials in environment variables, not directly in recipe files
  • Some sources support service account or IAM-based authentication

Step 4: Execute Ingestion

Run the ingestion pipeline via the CLI command. The CLI reads the recipe, connects to the source, extracts metadata, applies transformers, and pushes results to the DataHub sink.

Key considerations:

  • Monitor CLI output for extraction progress and errors
  • Failed records are logged but do not stop the pipeline
  • Use dry-run mode to validate configuration without emitting metadata

Step 5: Schedule Recurring Ingestion

Configure automated scheduling for periodic metadata refresh. Daily ingestion is recommended for most sources to keep metadata current.

Key considerations:

  • Apache Airflow is the recommended scheduler (BashOperator or PythonOperator)
  • CRON can be used as a simpler alternative
  • DataHub UI ingestion provides a built-in scheduling interface

Execution Diagram

GitHub URL

Workflow Repository