Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Datahub project Datahub Proto2DataHub

From Leeroopedia
Revision as of 14:43, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Datahub_project_Datahub_Proto2DataHub.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains Protobuf_Ingestion
Last Updated 2026-02-10 00:00 GMT

Overview

A CLI application that reads protobuf descriptor and source files and emits DataHub metadata change proposals (MCPs) for each protobuf message, converting protobuf schemas into DataHub dataset entities.

Description

Proto2DataHub is the main entry point for the protobuf-to-DataHub ingestion pipeline. It is a command-line tool built with Apache Commons CLI that accepts a compiled protobuf descriptor file (.dsc or .protoc) alongside individual .proto source files or a directory of source files. For each protobuf source file, it constructs a ProtobufDataset and emits the resulting metadata to DataHub via a configurable transport (REST, Kafka, or file).

Configuration options include:

  • --platform -- The data platform (defaults to kafka)
  • --descriptor -- The generated protobuf descriptor file (required)
  • --file / --directory -- Source file or root directory of proto files
  • --env -- Environment fabric type (DEV, PROD, etc.)
  • --transport -- REST (default), Kafka, or file output
  • --github_org / --slack_id -- Translate comment annotations to URLs
  • --subtype -- Custom subtype for entities (defaults to schema)
  • --exclude -- Glob patterns to exclude files when using --directory

The tool also supports environment variables (DATAHUB_API, DATAHUB_TOKEN, DATAHUB_USER, DATAHUB_ENV) as fallbacks for CLI options. An AppConfig inner class handles validation of all configuration parameters.

Usage

Use this tool as a standalone Java application to ingest protobuf schemas into DataHub. It is typically run as part of a CI/CD pipeline or a manual ingestion workflow after protobuf compilation. The tool supports both single-file and batch (directory) modes, making it suitable for processing an entire protobuf repository.

Code Reference

Source Location

Signature

public class Proto2DataHub {
    public static void main(String[] args) throws Exception;

    enum TransportOptions { REST, KAFKA, FILE }

    static class AppConfig {
        AppConfig(CommandLine cli);
        private AppConfig validate() throws Exception;
    }
}

Import

import datahub.protobuf.Proto2DataHub;

I/O Contract

Input Type Description
--descriptor File path Compiled protobuf descriptor file (.dsc / .protoc)
--file File path Individual .proto source file
--directory Directory path Root directory containing .proto source files
--platform String Data platform identifier (e.g., kafka, snowflake)
--datahub_api URL DataHub GMS REST API endpoint
--transport Enum Transport type: rest, kafka, or file
Output Type Description
MCPs MetadataChangeProposalWrapper Emitted to DataHub via REST, file, or Kafka transport
Exit code int 0 on success, 1 on partial/full failure

Usage Examples

// CLI invocation: single file mode
// java -jar datahub-protobuf.jar \
//   --descriptor my_repo.dsc \
//   --file src/main/proto/MyEvent.proto \
//   --platform kafka \
//   --env PROD \
//   --github_org datahub-project

// CLI invocation: directory mode with excludes
// java -jar datahub-protobuf.jar \
//   --descriptor my_repo.dsc \
//   --directory src/main/proto/ \
//   --exclude "build/**,generated/**" \
//   --transport file \
//   --filename output.json

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment