Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Datahub project Datahub Python Metadata Emission

From Leeroopedia


Knowledge Sources
Domains Data_Engineering, Metadata_Management, SDK
Last Updated 2026-02-09 12:00 GMT

Overview

End-to-end process for programmatically emitting metadata to DataHub from Python applications using the REST or Kafka emitter.

Description

This workflow covers using the DataHub Python SDK as a library to push metadata events programmatically. Rather than declarative recipe files, this approach uses Python code to construct and emit MetadataChangeProposal events directly. Two emission transports are available: REST (synchronous, simpler) and Kafka (asynchronous, decoupled). The workflow is suitable for CI/CD pipelines, custom orchestrators, and any Python application that needs to update DataHub metadata in real-time.

Usage

Execute this workflow when you need to emit metadata from custom Python code, CI/CD pipelines, or orchestration scripts. Use REST emission when synchronous acknowledgment is important and throughput requirements are moderate. Use Kafka emission when you need to decouple from DataHub server availability or require higher throughput.

Execution Steps

Step 1: Install Python SDK

Install the acryl-datahub package with the appropriate extras for REST or Kafka emission. REST requires no additional infrastructure; Kafka requires a running Kafka cluster.

Key considerations:

  • REST extras: pip install acryl-datahub[datahub-rest]
  • Kafka extras: pip install acryl-datahub[datahub-kafka]
  • Both can be installed simultaneously

Step 2: Create Emitter Instance

Instantiate either a REST or Kafka emitter configured with connection parameters. Verify connectivity before emitting events.

Key considerations:

  • REST emitter requires GMS server URL and optional token
  • Kafka emitter requires bootstrap servers and schema registry URL
  • Use test_connection() to verify the emitter can reach DataHub

Step 3: Construct Metadata Objects

Build the metadata aspect objects representing the information to emit. Use the mce_builder helpers to construct URNs and the appropriate aspect classes for entity properties.

Key considerations:

  • Use make_dataset_urn for creating dataset URN strings
  • Build aspect instances (DatasetProperties, SchemaMetadata, etc.)
  • Set appropriate fields on each aspect object

Step 4: Wrap in MetadataChangeProposal

Create a MetadataChangeProposalWrapper containing the entity URN, aspect name, and aspect value. This is the standardized envelope for all metadata changes.

Key considerations:

  • entityUrn identifies the target entity
  • aspect contains the metadata payload
  • changeType is typically UPSERT for creating or updating

Step 5: Emit Events

Send the MetadataChangeProposal to DataHub via the emitter. REST emitters return a synchronous Future; Kafka emitters buffer and require explicit flushing.

Key considerations:

  • REST: response = emitter.emit(mcp) returns a Future with success/failure
  • Kafka: emitter.emit(mcp, callback) is asynchronous
  • Kafka requires emitter.flush() to ensure all events are delivered
  • Handle errors and retries appropriately

Execution Diagram

GitHub URL

Workflow Repository