Heuristic:Datahub project Datahub Emitter Selection Strategy

Knowledge Sources	as-a-library.md rest_emitter.py
Domains	Architecture, Metadata_Emission, Decision_Framework
Last Updated	2026-02-10 00:00 GMT

Overview

Decision framework for choosing between REST and Kafka emitters based on delivery guarantees, throughput needs, and coupling tolerance.

Description

DataHub provides two primary emitter implementations for programmatic metadata emission: the REST emitter (`DataHubRestEmitter`) and the Kafka emitter (`DataHubKafkaEmitter`). Each has distinct trade-offs in terms of delivery semantics, coupling to the GMS server, throughput characteristics, and configuration complexity. Choosing the wrong emitter can lead to lost metadata, unnecessary coupling, or poor performance.

Usage

Use this heuristic when designing a new metadata integration or choosing an emitter for programmatic SDK use. It is especially relevant when deciding between the `[rest]` and `[kafka]` extras during installation.

The Insight (Rule of Thumb)

Use REST emitter when:
- You need acknowledgement that metadata was persisted (read-after-write)
- Simplicity is more important than throughput
- You want a blocking interface with immediate error feedback
- GMS server uptime is reliable

Use Kafka emitter when:
- You need to decouple the producer from GMS server availability
- Throughput matters more than acknowledgement
- You can tolerate eventual consistency (non-blocking)
- You already have Kafka infrastructure

Trade-off: REST gives delivery confirmation but couples to GMS uptime; Kafka decouples but loses immediate feedback and requires schema registry configuration.

Reasoning

The REST emitter sends MCPs directly to the GMS REST API and waits for a response, providing synchronous confirmation that metadata was accepted. This is ideal for interactive tools, CI/CD pipelines, and scenarios where you need to verify the write succeeded before proceeding.

The Kafka emitter publishes MCPs to a Kafka topic, which GMS consumes asynchronously. This is ideal for high-throughput batch pipelines and environments where GMS may be temporarily unavailable. However, it requires Avro serialization with a schema registry, and changing the serializer breaks compatibility.

The `async_flag` parameter on the REST emitter is deprecated in favor of the `emit_mode` parameter, which offers finer-grained control: `SYNC_PRIMARY`, `SYNC_WAIT`, `ASYNC`, `ASYNC_WAIT`.

Code Evidence

Emit mode options from `env_vars.py`:

# DATAHUB_EMIT_MODE options:
# SYNC_PRIMARY - synchronous primary write
# SYNC_WAIT - synchronous with wait
# ASYNC - asynchronous fire-and-forget
# ASYNC_WAIT - asynchronous with eventual confirmation

Deprecation of async_flag from `rest_emitter.py:648`:

@deprecated("Use emit_mode instead of async_flag")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment