Heuristic:Datahub project Datahub Emitter Selection Strategy
| Knowledge Sources | |
|---|---|
| Domains | Architecture, Metadata_Emission, Decision_Framework |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Decision framework for choosing between REST and Kafka emitters based on delivery guarantees, throughput needs, and coupling tolerance.
Description
DataHub provides two primary emitter implementations for programmatic metadata emission: the REST emitter (`DataHubRestEmitter`) and the Kafka emitter (`DataHubKafkaEmitter`). Each has distinct trade-offs in terms of delivery semantics, coupling to the GMS server, throughput characteristics, and configuration complexity. Choosing the wrong emitter can lead to lost metadata, unnecessary coupling, or poor performance.
Usage
Use this heuristic when designing a new metadata integration or choosing an emitter for programmatic SDK use. It is especially relevant when deciding between the `[rest]` and `[kafka]` extras during installation.
The Insight (Rule of Thumb)
- Use REST emitter when:
- You need acknowledgement that metadata was persisted (read-after-write)
- Simplicity is more important than throughput
- You want a blocking interface with immediate error feedback
- GMS server uptime is reliable
- Use Kafka emitter when:
- You need to decouple the producer from GMS server availability
- Throughput matters more than acknowledgement
- You can tolerate eventual consistency (non-blocking)
- You already have Kafka infrastructure
- Trade-off: REST gives delivery confirmation but couples to GMS uptime; Kafka decouples but loses immediate feedback and requires schema registry configuration.
Reasoning
The REST emitter sends MCPs directly to the GMS REST API and waits for a response, providing synchronous confirmation that metadata was accepted. This is ideal for interactive tools, CI/CD pipelines, and scenarios where you need to verify the write succeeded before proceeding.
The Kafka emitter publishes MCPs to a Kafka topic, which GMS consumes asynchronously. This is ideal for high-throughput batch pipelines and environments where GMS may be temporarily unavailable. However, it requires Avro serialization with a schema registry, and changing the serializer breaks compatibility.
The `async_flag` parameter on the REST emitter is deprecated in favor of the `emit_mode` parameter, which offers finer-grained control: `SYNC_PRIMARY`, `SYNC_WAIT`, `ASYNC`, `ASYNC_WAIT`.
Code Evidence
Emit mode options from `env_vars.py`:
# DATAHUB_EMIT_MODE options:
# SYNC_PRIMARY - synchronous primary write
# SYNC_WAIT - synchronous with wait
# ASYNC - asynchronous fire-and-forget
# ASYNC_WAIT - asynchronous with eventual confirmation
Deprecation of async_flag from `rest_emitter.py:648`:
@deprecated("Use emit_mode instead of async_flag")