Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Datahub project Datahub Metadata Emission

From Leeroopedia


Metadata

Field Value
principle_name Metadata Emission
description The process of transmitting metadata change proposals to the DataHub backend for persistence and indexing.
type principle
status active
last_updated 2026-02-10
version 1.0

Overview

Metadata Emission is the final step in the programmatic metadata pipeline. It sends Metadata Change Proposals (MCPs) to DataHub via the configured transport (REST HTTP or Kafka) for persistence in primary storage and indexing in search storage.

Description

Metadata emission is the process of transmitting constructed and wrapped MCPs to the DataHub backend. The Emitter protocol defines the emit() and flush() methods, and the two concrete implementations handle the transport-specific details:

REST Emission

The DataHubRestEmitter sends MCPs over HTTP to the GMS backend. It supports multiple emission modes that control the consistency-performance tradeoff:

  • SYNC_WAIT -- Fully synchronous processing that updates both primary storage (SQL) and search storage (Elasticsearch) before returning. Provides the strongest consistency guarantee.
  • SYNC_PRIMARY -- Synchronously updates the primary storage but asynchronously updates search storage. Balances consistency and performance.
  • ASYNC -- Queues the metadata change for asynchronous processing and returns immediately. Best for high-throughput scenarios.
  • ASYNC_WAIT -- Queues asynchronously but blocks until confirmation that the write has been fully persisted. Efficient due to backend parallelization while still providing strong consistency.

The REST emitter supports two backend endpoints:

  • RestLI -- The traditional DataHub REST API
  • OpenAPI -- The newer OpenAPI-based endpoint with batching and trace support

Key behaviors:

  • Payload size management with chunking for large batches (max ~15 MB per request)
  • Automatic retry with exponential backoff and special handling for HTTP 429 rate limiting
  • Unicode preservation in serialized payloads
  • Optional trace-based write verification for ASYNC_WAIT mode

Kafka Emission

The DatahubKafkaEmitter publishes MCPs to Kafka topics using Avro serialization. It is inherently asynchronous:

  • Messages are serialized using Avro schemas registered with the Schema Registry
  • MCE and MCP message types are routed to separate configurable Kafka topics
  • Delivery callbacks provide asynchronous notification of success or failure
  • The flush() method blocks until all pending messages are delivered

Callback Pattern

Both emitters support an optional callback function with signature Callable[[Exception, str], None]. On success, the callback receives (None, "success"). On failure, the callback receives (exception, error_message) and the exception is also re-raised (for the REST emitter).

Usage

Use Metadata Emission when sending constructed and wrapped metadata to DataHub for storage and discovery. This is the final step in the programmatic metadata pipeline:

  1. Install the SDK with the appropriate transport extra
  2. Initialize an emitter (REST or Kafka)
  3. Construct metadata objects (URNs and aspects)
  4. Wrap them in Metadata Change Proposals
  5. Emit the MCPs via the emitter
  6. Flush to ensure all pending emissions are delivered (critical for Kafka)

Choosing an Emit Mode (REST)

  • Use SYNC_PRIMARY (default) for standard ingestion where immediate searchability is not critical
  • Use SYNC_WAIT when metadata must be immediately searchable after emission
  • Use ASYNC for high-throughput batch ingestion where eventual consistency is acceptable
  • Use ASYNC_WAIT for efficient batch ingestion with persistence confirmation

Theoretical Basis

This principle follows the Producer pattern. The emitter acts as a message producer, delivering MCPs to the backend for consumption and processing.

  • REST emission follows a synchronous request-response model (with optional async modes). Each HTTP request carries one or more MCPs and receives a response confirming receipt and processing.
  • Kafka emission follows an asynchronous publish-subscribe model. MCPs are published to Kafka topics and consumed by the backend independently. The flush() method provides a synchronization point.

Key design aspects:

  • Transport abstraction -- The Emitter Protocol allows application code to be agnostic to the transport mechanism
  • Consistency modes -- The EmitMode enum provides fine-grained control over the consistency-performance tradeoff
  • Backpressure handling -- The REST emitter's retry logic and the Kafka emitter's buffering provide resilience against temporary backend unavailability

Related

Implementation:Datahub_project_Datahub_Emitter_Emit

Knowledge Sources

Domains

Data_Integration, Metadata_Management

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment