Workflow:Datahub project Datahub Python Metadata Emission
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, Metadata_Management, SDK |
| Last Updated | 2026-02-09 12:00 GMT |
Overview
End-to-end process for programmatically emitting metadata to DataHub from Python applications using the REST or Kafka emitter.
Description
This workflow covers using the DataHub Python SDK as a library to push metadata events programmatically. Rather than declarative recipe files, this approach uses Python code to construct and emit MetadataChangeProposal events directly. Two emission transports are available: REST (synchronous, simpler) and Kafka (asynchronous, decoupled). The workflow is suitable for CI/CD pipelines, custom orchestrators, and any Python application that needs to update DataHub metadata in real-time.
Usage
Execute this workflow when you need to emit metadata from custom Python code, CI/CD pipelines, or orchestration scripts. Use REST emission when synchronous acknowledgment is important and throughput requirements are moderate. Use Kafka emission when you need to decouple from DataHub server availability or require higher throughput.
Execution Steps
Step 1: Install Python SDK
Install the acryl-datahub package with the appropriate extras for REST or Kafka emission. REST requires no additional infrastructure; Kafka requires a running Kafka cluster.
Key considerations:
- REST extras: pip install acryl-datahub[datahub-rest]
- Kafka extras: pip install acryl-datahub[datahub-kafka]
- Both can be installed simultaneously
Step 2: Create Emitter Instance
Instantiate either a REST or Kafka emitter configured with connection parameters. Verify connectivity before emitting events.
Key considerations:
- REST emitter requires GMS server URL and optional token
- Kafka emitter requires bootstrap servers and schema registry URL
- Use test_connection() to verify the emitter can reach DataHub
Step 3: Construct Metadata Objects
Build the metadata aspect objects representing the information to emit. Use the mce_builder helpers to construct URNs and the appropriate aspect classes for entity properties.
Key considerations:
- Use make_dataset_urn for creating dataset URN strings
- Build aspect instances (DatasetProperties, SchemaMetadata, etc.)
- Set appropriate fields on each aspect object
Step 4: Wrap in MetadataChangeProposal
Create a MetadataChangeProposalWrapper containing the entity URN, aspect name, and aspect value. This is the standardized envelope for all metadata changes.
Key considerations:
- entityUrn identifies the target entity
- aspect contains the metadata payload
- changeType is typically UPSERT for creating or updating
Step 5: Emit Events
Send the MetadataChangeProposal to DataHub via the emitter. REST emitters return a synchronous Future; Kafka emitters buffer and require explicit flushing.
Key considerations:
- REST: response = emitter.emit(mcp) returns a Future with success/failure
- Kafka: emitter.emit(mcp, callback) is asynchronous
- Kafka requires emitter.flush() to ensure all events are delivered
- Handle errors and retries appropriately