Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Datahub project Datahub Protobuf Metadata Emission

From Leeroopedia


Field Value
Principle Name Protobuf_Metadata_Emission
Category Metadata Delivery
Workflow Protobuf_Schema_Ingestion
Repository https://github.com/datahub-project/datahub
Implemented By Implementation:Datahub_project_Datahub_Proto2DataHub_RestEmitter_Emit
Last Updated 2026-02-09 17:00 GMT

Overview

Description

Protobuf Metadata Emission is the principle governing the final stage of the protobuf ingestion pipeline: delivering schema-derived Metadata Change Proposals (MCPs) to DataHub for persistence. After protobuf schemas have been compiled, annotated, and converted into structured metadata aspects, the emission stage is responsible for reliably transmitting each MCP to DataHub's backend through a configured transport mechanism.

This principle establishes that emission must be reliable (every MCP is delivered or failures are reported), transport-agnostic (the same conversion output can be sent via REST, Kafka, or file), and observable (the tool tracks and reports the number of events emitted and files processed).

Usage

The emission principle is applied at the final stage of the Proto2DataHub pipeline. After the ProtobufDataset builder produces a stream of MetadataChangeProposalWrapper collections, the emitter iterates over each MCP and sends it to DataHub.

Supported transport modes include:

  • REST (default): Sends MCPs to the DataHub GMS REST API via RestEmitter.
  • File: Writes MCPs to a local file via FileEmitter for later bulk ingestion.
  • Kafka: Planned but not yet supported; would publish MCPs directly to the metadata change event stream.

Theoretical Basis

Batch Emission Pattern

The emission stage follows a batch streaming pattern where MCPs are produced lazily from the schema conversion stage and emitted eagerly to the DataHub sink. This pattern has several characteristics:

  1. Lazy production: The ProtobufDataset.getAllMetadataChangeProposals() method returns a Stream<Collection<MetadataChangeProposalWrapper>>. The stream is evaluated lazily, meaning MCPs for a given proto file are only generated when the emitter is ready to consume them.
  2. Eager emission: Each MCP is emitted immediately upon generation via emitter.emit(mcpw, null).get(). The .get() call blocks until the emission completes, ensuring synchronous delivery and immediate error detection.
  3. Per-file granularity: The outer loop iterates over input files, and for each file, the inner loop iterates over all MCPs produced by that file's ProtobufDataset. This means a failure on one file does not prevent processing of subsequent files.

Streaming MCPs from Schema Conversion to DataHub Sink

The data flow from conversion to emission follows a two-level streaming model:

Level 1 -- File stream: The Proto2DataHub.main method iterates over input files (either a single file or a directory walk). Each file produces a ProtobufDataset instance.

Level 2 -- MCP stream: Each ProtobufDataset produces two collections of MCPs via getAllMetadataChangeProposals():

  • Visitor MCPs: Produced by the TagVisitor and other model-level visitors that create standalone entities (e.g., Tag entities).
  • Dataset MCPs: Produced by the DatasetVisitor and the ProtobufDataset itself, including SchemaMetadata, Status, SubTypes, DatasetProperties, Ownership, GlobalTags, Domains, GlossaryTerms, InstitutionalMemory, and Deprecation.

Both levels are flattened via flatMap(Collection::stream) into a single stream of individual MCPs for emission.

Status Tracking and Error Handling

The emission principle mandates comprehensive status tracking:

  • Event counter: An AtomicInteger totalEvents tracks the total number of successfully emitted MCPs across all files.
  • File counter: An AtomicInteger totalFiles tracks the number of files processed.
  • Exit code: An AtomicInteger exitCode is set to 1 if any emission fails, enabling CI/CD pipelines to detect partial failures.
  • Error isolation: Exceptions during the processing of one file are caught and logged, allowing the pipeline to continue with remaining files. The special case of "Cannot autodetect protobuf Message" is treated as a warning rather than an error.

The final status report indicates whether all emissions succeeded or whether partial failures occurred, along with the total counts.

Transport Abstraction

The Emitter interface provides a common abstraction over different transport mechanisms. The Proto2DataHub tool instantiates the appropriate emitter based on the --transport flag:

Transport Emitter Class Configuration
rest RestEmitter --datahub_api, --datahub_token
file FileEmitter --filename
kafka (not yet implemented) --

This abstraction ensures that the conversion pipeline is completely decoupled from the delivery mechanism, allowing new transports to be added without modifying the conversion logic.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment