Principle:Datahub project Datahub Metadata Emission
Metadata
| Field | Value |
|---|---|
| principle_name | Metadata Emission |
| description | The process of transmitting metadata change proposals to the DataHub backend for persistence and indexing. |
| type | principle |
| status | active |
| last_updated | 2026-02-10 |
| version | 1.0 |
Overview
Metadata Emission is the final step in the programmatic metadata pipeline. It sends Metadata Change Proposals (MCPs) to DataHub via the configured transport (REST HTTP or Kafka) for persistence in primary storage and indexing in search storage.
Description
Metadata emission is the process of transmitting constructed and wrapped MCPs to the DataHub backend. The Emitter protocol defines the emit() and flush() methods, and the two concrete implementations handle the transport-specific details:
REST Emission
The DataHubRestEmitter sends MCPs over HTTP to the GMS backend. It supports multiple emission modes that control the consistency-performance tradeoff:
- SYNC_WAIT -- Fully synchronous processing that updates both primary storage (SQL) and search storage (Elasticsearch) before returning. Provides the strongest consistency guarantee.
- SYNC_PRIMARY -- Synchronously updates the primary storage but asynchronously updates search storage. Balances consistency and performance.
- ASYNC -- Queues the metadata change for asynchronous processing and returns immediately. Best for high-throughput scenarios.
- ASYNC_WAIT -- Queues asynchronously but blocks until confirmation that the write has been fully persisted. Efficient due to backend parallelization while still providing strong consistency.
The REST emitter supports two backend endpoints:
- RestLI -- The traditional DataHub REST API
- OpenAPI -- The newer OpenAPI-based endpoint with batching and trace support
Key behaviors:
- Payload size management with chunking for large batches (max ~15 MB per request)
- Automatic retry with exponential backoff and special handling for HTTP 429 rate limiting
- Unicode preservation in serialized payloads
- Optional trace-based write verification for ASYNC_WAIT mode
Kafka Emission
The DatahubKafkaEmitter publishes MCPs to Kafka topics using Avro serialization. It is inherently asynchronous:
- Messages are serialized using Avro schemas registered with the Schema Registry
- MCE and MCP message types are routed to separate configurable Kafka topics
- Delivery callbacks provide asynchronous notification of success or failure
- The
flush()method blocks until all pending messages are delivered
Callback Pattern
Both emitters support an optional callback function with signature Callable[[Exception, str], None]. On success, the callback receives (None, "success"). On failure, the callback receives (exception, error_message) and the exception is also re-raised (for the REST emitter).
Usage
Use Metadata Emission when sending constructed and wrapped metadata to DataHub for storage and discovery. This is the final step in the programmatic metadata pipeline:
- Install the SDK with the appropriate transport extra
- Initialize an emitter (REST or Kafka)
- Construct metadata objects (URNs and aspects)
- Wrap them in Metadata Change Proposals
- Emit the MCPs via the emitter
- Flush to ensure all pending emissions are delivered (critical for Kafka)
Choosing an Emit Mode (REST)
- Use SYNC_PRIMARY (default) for standard ingestion where immediate searchability is not critical
- Use SYNC_WAIT when metadata must be immediately searchable after emission
- Use ASYNC for high-throughput batch ingestion where eventual consistency is acceptable
- Use ASYNC_WAIT for efficient batch ingestion with persistence confirmation
Theoretical Basis
This principle follows the Producer pattern. The emitter acts as a message producer, delivering MCPs to the backend for consumption and processing.
- REST emission follows a synchronous request-response model (with optional async modes). Each HTTP request carries one or more MCPs and receives a response confirming receipt and processing.
- Kafka emission follows an asynchronous publish-subscribe model. MCPs are published to Kafka topics and consumed by the backend independently. The
flush()method provides a synchronization point.
Key design aspects:
- Transport abstraction -- The
EmitterProtocol allows application code to be agnostic to the transport mechanism - Consistency modes -- The
EmitModeenum provides fine-grained control over the consistency-performance tradeoff - Backpressure handling -- The REST emitter's retry logic and the Kafka emitter's buffering provide resilience against temporary backend unavailability
Related
- Implemented by: Datahub_project_Datahub_Emitter_Emit
Implementation:Datahub_project_Datahub_Emitter_Emit
- Depends on: Datahub_project_Datahub_Emitter_Initialization, Datahub_project_Datahub_Metadata_Change_Proposal
- Heuristic: Heuristic:Datahub_project_Datahub_Emitter_Selection_Strategy