Principle:Datahub project Datahub Emitter Initialization

Metadata

Field	Value
principle_name	Emitter Initialization
description	The process of creating an authenticated transport client for emitting metadata to DataHub via REST or Kafka.
type	principle
status	active
last_updated	2026-02-10
version	1.0

Overview

Emitter Initialization is the process of creating an authenticated transport client for emitting metadata to DataHub via REST or Kafka. The emitter is the primary interface through which Python applications send metadata change proposals to the DataHub backend.

Description

Emitter initialization establishes a connection to the DataHub backend. The Emitter protocol (defined in datahub.emitter.generic_emitter) defines the contract that all emitter implementations must satisfy:

class Emitter(Protocol):
    def emit(
        self,
        item: Union[MetadataChangeEvent, MetadataChangeProposal, MetadataChangeProposalWrapper],
        callback: Optional[Callable[[Exception, str], None]] = None,
    ) -> None: ...

    def flush(self) -> None: ...

Two concrete implementations exist:

DataHubRestEmitter -- Communicates with DataHub GMS over HTTP. Handles session management, retry logic with exponential backoff, SSL configuration, authentication via bearer tokens, and payload size management. Supports both RestLI and OpenAPI endpoints.
DatahubKafkaEmitter -- Publishes metadata to Kafka topics using Avro serialization. Manages schema registry integration, topic routing for MCE and MCP message types, and optional OAuth authentication for managed Kafka services like AWS MSK.

Both implementations extend the Closeable interface alongside the Emitter protocol, ensuring proper resource cleanup.

Usage

Use Emitter Initialization when starting a Python application that will emit metadata events to DataHub. The choice of emitter depends on infrastructure and requirements:

DataHubRestEmitter is appropriate when:
- Direct, synchronous communication with GMS is needed
- Request-response confirmation of each emission is required
- The application needs to inspect server configuration (e.g., feature support)
- OpenAPI-based batching and trace-based write verification are desired

DatahubKafkaEmitter is appropriate when:
- High-throughput asynchronous emission is needed
- Kafka infrastructure is already deployed
- Decoupled, fire-and-forget semantics are acceptable
- Schema registry-based serialization is preferred

Theoretical Basis

This principle follows the Strategy pattern. The Emitter Protocol defines the interface, with REST and Kafka as interchangeable strategies. Selection depends on infrastructure and latency requirements.

The Protocol-based approach (using typing_extensions.Protocol) enables structural subtyping, meaning any class that implements emit() and flush() with the correct signatures is considered an Emitter without explicit inheritance. This provides flexibility for custom emitter implementations.

Key design aspects:

Authentication abstraction -- Both emitters handle authentication transparently (bearer tokens for REST, SASL/OAuth for Kafka)
Retry resilience -- The REST emitter includes weighted retry logic with special handling for HTTP 429 (rate limiting) responses
Configuration-driven behavior -- Both emitters accept configuration objects that control transport behavior

Knowledge Sources

Domains

Data_Integration, Metadata_Management

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment