Principle:Datahub project Datahub Metadata Emission

Metadata

Field	Value
principle_name	Metadata Emission
description	The process of transmitting metadata change proposals to the DataHub backend for persistence and indexing.
type	principle
status	active
last_updated	2026-02-10
version	1.0

Overview

Metadata Emission is the final step in the programmatic metadata pipeline. It sends Metadata Change Proposals (MCPs) to DataHub via the configured transport (REST HTTP or Kafka) for persistence in primary storage and indexing in search storage.

Description

Metadata emission is the process of transmitting constructed and wrapped MCPs to the DataHub backend. The Emitter protocol defines the emit() and flush() methods, and the two concrete implementations handle the transport-specific details:

REST Emission

The DataHubRestEmitter sends MCPs over HTTP to the GMS backend. It supports multiple emission modes that control the consistency-performance tradeoff:

SYNC_WAIT -- Fully synchronous processing that updates both primary storage (SQL) and search storage (Elasticsearch) before returning. Provides the strongest consistency guarantee.
SYNC_PRIMARY -- Synchronously updates the primary storage but asynchronously updates search storage. Balances consistency and performance.
ASYNC -- Queues the metadata change for asynchronous processing and returns immediately. Best for high-throughput scenarios.
ASYNC_WAIT -- Queues asynchronously but blocks until confirmation that the write has been fully persisted. Efficient due to backend parallelization while still providing strong consistency.

The REST emitter supports two backend endpoints:

RestLI -- The traditional DataHub REST API
OpenAPI -- The newer OpenAPI-based endpoint with batching and trace support

Key behaviors:

Payload size management with chunking for large batches (max ~15 MB per request)
Automatic retry with exponential backoff and special handling for HTTP 429 rate limiting
Unicode preservation in serialized payloads
Optional trace-based write verification for ASYNC_WAIT mode

Kafka Emission

The DatahubKafkaEmitter publishes MCPs to Kafka topics using Avro serialization. It is inherently asynchronous:

Messages are serialized using Avro schemas registered with the Schema Registry
MCE and MCP message types are routed to separate configurable Kafka topics
Delivery callbacks provide asynchronous notification of success or failure
The flush() method blocks until all pending messages are delivered

Callback Pattern

Both emitters support an optional callback function with signature Callable[[Exception, str], None]. On success, the callback receives (None, "success"). On failure, the callback receives (exception, error_message) and the exception is also re-raised (for the REST emitter).

Usage

Use Metadata Emission when sending constructed and wrapped metadata to DataHub for storage and discovery. This is the final step in the programmatic metadata pipeline:

Install the SDK with the appropriate transport extra
Initialize an emitter (REST or Kafka)
Construct metadata objects (URNs and aspects)
Wrap them in Metadata Change Proposals
Emit the MCPs via the emitter
Flush to ensure all pending emissions are delivered (critical for Kafka)

Choosing an Emit Mode (REST)

Use SYNC_PRIMARY (default) for standard ingestion where immediate searchability is not critical
Use SYNC_WAIT when metadata must be immediately searchable after emission
Use ASYNC for high-throughput batch ingestion where eventual consistency is acceptable
Use ASYNC_WAIT for efficient batch ingestion with persistence confirmation

Theoretical Basis

This principle follows the Producer pattern. The emitter acts as a message producer, delivering MCPs to the backend for consumption and processing.

REST emission follows a synchronous request-response model (with optional async modes). Each HTTP request carries one or more MCPs and receives a response confirming receipt and processing.
Kafka emission follows an asynchronous publish-subscribe model. MCPs are published to Kafka topics and consumed by the backend independently. The flush() method provides a synchronization point.

Key design aspects:

Transport abstraction -- The Emitter Protocol allows application code to be agnostic to the transport mechanism
Consistency modes -- The EmitMode enum provides fine-grained control over the consistency-performance tradeoff
Backpressure handling -- The REST emitter's retry logic and the Kafka emitter's buffering provide resilience against temporary backend unavailability

Knowledge Sources

Domains

Data_Integration, Metadata_Management

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment