Principle:Datahub project Datahub Emitter Initialization
Metadata
| Field | Value |
|---|---|
| principle_name | Emitter Initialization |
| description | The process of creating an authenticated transport client for emitting metadata to DataHub via REST or Kafka. |
| type | principle |
| status | active |
| last_updated | 2026-02-10 |
| version | 1.0 |
Overview
Emitter Initialization is the process of creating an authenticated transport client for emitting metadata to DataHub via REST or Kafka. The emitter is the primary interface through which Python applications send metadata change proposals to the DataHub backend.
Description
Emitter initialization establishes a connection to the DataHub backend. The Emitter protocol (defined in datahub.emitter.generic_emitter) defines the contract that all emitter implementations must satisfy:
class Emitter(Protocol):
def emit(
self,
item: Union[MetadataChangeEvent, MetadataChangeProposal, MetadataChangeProposalWrapper],
callback: Optional[Callable[[Exception, str], None]] = None,
) -> None: ...
def flush(self) -> None: ...
Two concrete implementations exist:
- DataHubRestEmitter -- Communicates with DataHub GMS over HTTP. Handles session management, retry logic with exponential backoff, SSL configuration, authentication via bearer tokens, and payload size management. Supports both RestLI and OpenAPI endpoints.
- DatahubKafkaEmitter -- Publishes metadata to Kafka topics using Avro serialization. Manages schema registry integration, topic routing for MCE and MCP message types, and optional OAuth authentication for managed Kafka services like AWS MSK.
Both implementations extend the Closeable interface alongside the Emitter protocol, ensuring proper resource cleanup.
Usage
Use Emitter Initialization when starting a Python application that will emit metadata events to DataHub. The choice of emitter depends on infrastructure and requirements:
- DataHubRestEmitter is appropriate when:
- Direct, synchronous communication with GMS is needed
- Request-response confirmation of each emission is required
- The application needs to inspect server configuration (e.g., feature support)
- OpenAPI-based batching and trace-based write verification are desired
- DatahubKafkaEmitter is appropriate when:
- High-throughput asynchronous emission is needed
- Kafka infrastructure is already deployed
- Decoupled, fire-and-forget semantics are acceptable
- Schema registry-based serialization is preferred
Theoretical Basis
This principle follows the Strategy pattern. The Emitter Protocol defines the interface, with REST and Kafka as interchangeable strategies. Selection depends on infrastructure and latency requirements.
The Protocol-based approach (using typing_extensions.Protocol) enables structural subtyping, meaning any class that implements emit() and flush() with the correct signatures is considered an Emitter without explicit inheritance. This provides flexibility for custom emitter implementations.
Key design aspects:
- Authentication abstraction -- Both emitters handle authentication transparently (bearer tokens for REST, SASL/OAuth for Kafka)
- Retry resilience -- The REST emitter includes weighted retry logic with special handling for HTTP 429 (rate limiting) responses
- Configuration-driven behavior -- Both emitters accept configuration objects that control transport behavior
Related
- Implemented by: Datahub_project_Datahub_DataHubRestEmitter_Init
Implementation:Datahub_project_Datahub_DataHubRestEmitter_Init
- Depends on: Datahub_project_Datahub_Python_SDK_Installation
- Used by: Datahub_project_Datahub_Metadata_Emission
- Heuristic: Heuristic:Datahub_project_Datahub_Emitter_Selection_Strategy