Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Datahub project Datahub Protobuf Schema Ingestion

From Leeroopedia


Knowledge Sources
Domains Data_Engineering, Schema_Management, Protobuf
Last Updated 2026-02-09 17:00 GMT

Overview

End-to-end process for converting Protocol Buffer schema definitions into DataHub dataset metadata, including schema fields, tags, ownership, glossary terms, and domain assignments.

Description

This workflow covers the datahub-protobuf module which converts compiled protobuf descriptor files into DataHub dataset entities. The process uses a visitor pattern to traverse the protobuf type graph and extract rich metadata from message definitions, field annotations, comments, and custom extensions. The extracted metadata includes schema fields, ownership, tags, glossary terms, domains, custom properties, and documentation links. Output is emitted as MetadataChangeProposals via REST or file transport.

Usage

Execute this workflow when you have Protocol Buffer schema definitions (used for Kafka topics, gRPC services, or data serialization) and want to register them as DataHub datasets with full schema metadata and governance annotations. This is particularly useful for organizations using protobuf-defined Kafka event schemas that need to be discoverable and governed in DataHub.

Execution Steps

Step 1: Compile Protobuf Schemas

Compile the protobuf source files into a binary descriptor file using the protoc compiler. The compilation must include source info and imports to preserve comments and cross-file references needed for metadata extraction.

Key considerations:

  • Use the --include_imports and --include_source_info flags with protoc
  • Output format is a binary descriptor set (.protoc or .dsc file)
  • All imported proto files must be resolvable during compilation
  • Optionally include custom meta.proto extensions for rich metadata annotations

Step 2: Annotate Schemas with Metadata Extensions

Optionally extend protobuf schemas with custom annotations using the DataHub meta.proto extension. This enables embedding tags, glossary terms, ownership, domains, and custom properties directly in the protobuf schema definitions.

Key considerations:

  • The meta.proto file defines DataHubMetadataType enum for annotation types
  • Annotations can be applied at message-level or field-level
  • C-style comments are automatically extracted as descriptions
  • URLs in comments are parsed as institutional memory (GitHub teams, Slack channels)

Step 3: Configure the Proto2DataHub CLI

Set up the Proto2DataHub command-line tool with the compiled descriptor file path, DataHub connection details, platform name, and environment. The CLI supports REST and File emission transports.

Key considerations:

  • Required: descriptor file path and DataHub API endpoint
  • Platform defaults to kafka (configurable for other platforms)
  • Environment defaults to DEV (configurable: DEV, PROD, etc.)
  • Optional: GitHub org and Slack team ID for link resolution
  • Supports directory walking with glob-based exclusion patterns

Step 4: Execute Schema Conversion

Run the Proto2DataHub CLI to process the descriptor file. The tool builds a ProtobufGraph from the descriptor, executes visitors to extract metadata, and generates MetadataChangeProposals for each protobuf message type.

Key considerations:

  • Each protobuf message generates a DataHub Dataset entity
  • The visitor pattern extracts: schema fields, tags, terms, ownership, domains, properties
  • Schema fields are sorted by field weight and path for consistent ordering
  • Nested messages, enums, oneOf groups, and maps are all handled

Step 5: Emit Metadata to DataHub

The generated MCPs are sent to DataHub via the configured transport. REST emission sends HTTP requests to the GMS API. File emission writes JSON output for offline review or later import.

Key considerations:

  • Each protobuf message produces multiple MCPs (schema, status, governance aspects)
  • The CLI reports success count and any emission failures
  • File output can be used for debugging before REST emission
  • Authentication token is required for secured DataHub instances

Execution Diagram

GitHub URL

Workflow Repository