Principle:Datahub project Datahub Protobuf Metadata Emission
| Field | Value |
|---|---|
| Principle Name | Protobuf_Metadata_Emission |
| Category | Metadata Delivery |
| Workflow | Protobuf_Schema_Ingestion |
| Repository | https://github.com/datahub-project/datahub |
| Implemented By | Implementation:Datahub_project_Datahub_Proto2DataHub_RestEmitter_Emit |
| Last Updated | 2026-02-09 17:00 GMT |
Overview
Description
Protobuf Metadata Emission is the principle governing the final stage of the protobuf ingestion pipeline: delivering schema-derived Metadata Change Proposals (MCPs) to DataHub for persistence. After protobuf schemas have been compiled, annotated, and converted into structured metadata aspects, the emission stage is responsible for reliably transmitting each MCP to DataHub's backend through a configured transport mechanism.
This principle establishes that emission must be reliable (every MCP is delivered or failures are reported), transport-agnostic (the same conversion output can be sent via REST, Kafka, or file), and observable (the tool tracks and reports the number of events emitted and files processed).
Usage
The emission principle is applied at the final stage of the Proto2DataHub pipeline. After the ProtobufDataset builder produces a stream of MetadataChangeProposalWrapper collections, the emitter iterates over each MCP and sends it to DataHub.
Supported transport modes include:
- REST (default): Sends MCPs to the DataHub GMS REST API via
RestEmitter. - File: Writes MCPs to a local file via
FileEmitterfor later bulk ingestion. - Kafka: Planned but not yet supported; would publish MCPs directly to the metadata change event stream.
Theoretical Basis
Batch Emission Pattern
The emission stage follows a batch streaming pattern where MCPs are produced lazily from the schema conversion stage and emitted eagerly to the DataHub sink. This pattern has several characteristics:
- Lazy production: The
ProtobufDataset.getAllMetadataChangeProposals()method returns aStream<Collection<MetadataChangeProposalWrapper>>. The stream is evaluated lazily, meaning MCPs for a given proto file are only generated when the emitter is ready to consume them. - Eager emission: Each MCP is emitted immediately upon generation via
emitter.emit(mcpw, null).get(). The.get()call blocks until the emission completes, ensuring synchronous delivery and immediate error detection. - Per-file granularity: The outer loop iterates over input files, and for each file, the inner loop iterates over all MCPs produced by that file's
ProtobufDataset. This means a failure on one file does not prevent processing of subsequent files.
Streaming MCPs from Schema Conversion to DataHub Sink
The data flow from conversion to emission follows a two-level streaming model:
Level 1 -- File stream: The Proto2DataHub.main method iterates over input files (either a single file or a directory walk). Each file produces a ProtobufDataset instance.
Level 2 -- MCP stream: Each ProtobufDataset produces two collections of MCPs via getAllMetadataChangeProposals():
- Visitor MCPs: Produced by the
TagVisitorand other model-level visitors that create standalone entities (e.g., Tag entities). - Dataset MCPs: Produced by the
DatasetVisitorand theProtobufDatasetitself, includingSchemaMetadata,Status,SubTypes,DatasetProperties,Ownership,GlobalTags,Domains,GlossaryTerms,InstitutionalMemory, andDeprecation.
Both levels are flattened via flatMap(Collection::stream) into a single stream of individual MCPs for emission.
Status Tracking and Error Handling
The emission principle mandates comprehensive status tracking:
- Event counter: An
AtomicInteger totalEventstracks the total number of successfully emitted MCPs across all files. - File counter: An
AtomicInteger totalFilestracks the number of files processed. - Exit code: An
AtomicInteger exitCodeis set to1if any emission fails, enabling CI/CD pipelines to detect partial failures. - Error isolation: Exceptions during the processing of one file are caught and logged, allowing the pipeline to continue with remaining files. The special case of "Cannot autodetect protobuf Message" is treated as a warning rather than an error.
The final status report indicates whether all emissions succeeded or whether partial failures occurred, along with the total counts.
Transport Abstraction
The Emitter interface provides a common abstraction over different transport mechanisms. The Proto2DataHub tool instantiates the appropriate emitter based on the --transport flag:
| Transport | Emitter Class | Configuration |
|---|---|---|
rest |
RestEmitter |
--datahub_api, --datahub_token
|
file |
FileEmitter |
--filename
|
kafka |
(not yet implemented) | -- |
This abstraction ensures that the conversion pipeline is completely decoupled from the delivery mechanism, allowing new transports to be added without modifying the conversion logic.