Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Datahub project Datahub Metadata Ingestion Pipeline

From Leeroopedia


Knowledge Sources
Domains Data_Engineering, Metadata_Management, ETL
Last Updated 2026-02-09 17:00 GMT

Overview

End-to-end process for extracting metadata from external data sources and ingesting it into a DataHub instance using the Python-based ingestion framework.

Description

This workflow covers the standard procedure for pulling metadata from data platforms (databases, BI tools, data warehouses, etc.) and loading it into DataHub. It uses a YAML-based recipe configuration that defines a source connector, optional transformers, and a sink destination. The framework supports over 50 source connectors and can be executed via the CLI, programmatic Python API, or the DataHub UI ingestion tab.

Usage

Execute this workflow when you need to catalog metadata from an external data platform into DataHub. Typical triggers include onboarding a new data source, scheduling recurring metadata refresh, or performing a one-time metadata import. You should have access credentials for the source system and a running DataHub instance.

Execution Steps

Step 1: Install the CLI and Source Connector

Install the DataHub CLI Python package and the appropriate connector plugin for your data source. Each source type has a dedicated plugin that includes the necessary drivers and dependencies.

Key considerations:

  • Requires Python 3.8+
  • Each source connector is installed as an extras package (e.g., bigquery, mysql, looker)
  • Virtual environments are recommended to isolate dependencies

Step 2: Configure the Ingestion Recipe

Create a YAML recipe file that defines three sections: source configuration, optional transformers, and sink configuration. The source section specifies the connector type and connection credentials. The sink section defaults to DataHub REST API.

Key considerations:

  • Use environment variable substitution for secrets and credentials
  • Each recipe pairs exactly one source with one sink
  • The transformers section can add tags, ownership, or modify metadata in transit
  • The special directive __DATAHUB_TO_FILE_ handles complex configs like SSL certificates

Step 3: Validate the Recipe and Test Connection

Verify the recipe YAML syntax and test connectivity to both the source system and the DataHub instance before running a full ingestion. This catches credential issues and network problems early.

Key considerations:

  • Use the CLI test command to validate connectivity
  • Check that the DataHub GMS endpoint is accessible
  • Verify source system credentials have read access to metadata

Step 4: Execute the Ingestion

Run the ingestion pipeline using the CLI command with the recipe file. The framework connects to the source, extracts metadata (datasets, schemas, lineage, ownership, tags), applies any transformers, and sends the results to the configured sink.

Key considerations:

  • Ingestion can be run as a one-time batch or scheduled for recurring execution
  • The CLI provides progress output and a summary report
  • Failed records are logged but do not halt the entire pipeline
  • The UI ingestion tab provides an alternative graphical interface

Step 5: Verify Ingested Metadata

After ingestion completes, verify that the expected entities appear in the DataHub UI. Check that datasets, schemas, lineage relationships, and governance metadata (tags, owners, glossary terms) are correctly represented.

Key considerations:

  • Browse the DataHub UI to confirm entity counts
  • Verify schema fields and lineage edges
  • Check the ingestion run summary for any warnings or errors

Execution Diagram

GitHub URL

Workflow Repository