Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Datahub project Datahub Proto2DataHub Main

From Leeroopedia


Field Value
Implementation Name Proto2DataHub_Main
Type API Doc
Workflow Protobuf_Schema_Ingestion
Repository https://github.com/datahub-project/datahub
Implements Principle:Datahub_project_Datahub_Proto2DataHub_Configuration
Last Updated 2026-02-09 17:00 GMT

Overview

Description

Proto2DataHub Main is the entry-point class for the DataHub protobuf ingestion tool. It provides a command-line interface built on Apache Commons CLI that parses user-provided arguments, validates the configuration, instantiates the appropriate emitter transport, and orchestrates the end-to-end pipeline of reading protobuf files, constructing ProtobufDataset instances, and emitting Metadata Change Proposals to DataHub.

The class is packaged as the main class of the datahub-protobuf shadow JAR and can be invoked directly via java -jar or through the Gradle run task.

Usage

The tool is invoked from the command line with a set of required and optional flags that configure input sources, DataHub connection parameters, and metadata enrichment options.

Code Reference

Source Location

metadata-integration/java/datahub-protobuf/src/main/java/datahub/protobuf/Proto2DataHub.java, lines 1-441.

Signature

public class Proto2DataHub {
    public static void main(String[] args) throws Exception
}

Import

import datahub.protobuf.Proto2DataHub;

I/O Contract

Direction Type Description
Input String[] args Command-line arguments parsed by Apache Commons CLI.
Input .protoc / .dsc descriptor file Binary protobuf descriptor set file specified via --descriptor.
Input .proto source file(s) Protobuf source files specified via --file or discovered via --directory.
Output MCPs emitted to DataHub Metadata Change Proposals sent via the configured transport (REST, File, or Kafka).
Output Exit code 0 on full success, 1 on any emission failure.
Output Console status report Summary line reporting total events emitted and files processed.

CLI Options

Flag Type Required Default Description
--descriptor String Yes -- Path to the compiled protobuf descriptor file (.protoc or .dsc).
--file String Conditional -- Path to a single protobuf source file. Required if --directory is not specified.
--directory String Conditional -- Root directory containing protobuf source files. Required if --file is not specified.
--exclude String (comma-separated) No -- Glob patterns to exclude files when using --directory. E.g., "build/**,generated/**".
--message_name String No (auto-detect) Fully qualified protobuf message name to read from the descriptor.
--datahub_api String No http://localhost:8080 DataHub GMS API endpoint URL.
--datahub_token String No (empty) Authentication token for DataHub API access.
--datahub_user String No datahub DataHub user to attribute the ingestion to.
--platform String No kafka Data platform identifier (e.g., kafka, snowflake).
--env String No DEV FabricType environment (e.g., DEV, PROD, STAGING).
--transport String No rest Transport mechanism: rest, kafka, or file.
--filename String Conditional -- Output filename when using file transport.
--github_org String No -- GitHub organization for resolving team references in comments.
--slack_id String No -- Slack team ID for resolving channel references in comments. Must start with T.
--subtype String No schema Custom subtype to attach to all entities (e.g., event, topic).
-protocProp Flag No false Store the base64-encoded protoc as a custom property on the dataset.
--help Flag No -- Print usage help and exit.

Usage Examples

Single File Ingestion via REST

java -jar datahub-protobuf.jar \
  --descriptor schemas/my_schema.protoc \
  --file schemas/my_schema.proto \
  --datahub_api http://datahub-gms:8080 \
  --datahub_token eyJhbGciOiJIUzI1... \
  --platform kafka \
  --env PROD

Directory Batch Ingestion

java -jar datahub-protobuf.jar \
  --descriptor all_schemas.dsc \
  --directory ./schemas \
  --exclude "build/**,test/**" \
  --datahub_api http://datahub-gms:8080 \
  --platform kafka \
  --env PROD \
  --github_org datahub-project \
  --slack_id TUMKD5EGJ

File Transport for Offline Ingestion

java -jar datahub-protobuf.jar \
  --descriptor schemas/events.protoc \
  --directory ./schemas/events \
  --transport file \
  --filename output_mcps.json

Legacy Two-Argument Format

The tool supports a legacy invocation format for backward compatibility:

java -jar datahub-protobuf.jar my_schema.protoc my_schema.proto

This is automatically translated to --descriptor my_schema.protoc --file my_schema.proto.

Internal Processing Flow

// 1. Parse CLI arguments
CommandLine cli = parser.parse(options, args);
AppConfig config = new AppConfig(cli).validate();

// 2. Create emitter based on transport
Emitter emitter = RestEmitter.create(b -> b.server(config.datahubAPI).token(config.datahubToken));

// 3. Create audit stamp
AuditStamp auditStamp = new AuditStamp()
    .setTime(System.currentTimeMillis())
    .setActor(new CorpuserUrn(config.datahubUser));

// 4. Iterate over input files
filePathStream.forEach(filePath -> {
    ProtobufDataset dataset = ProtobufDataset.builder()
        .setDataPlatformUrn(new DataPlatformUrn(config.dataPlatform))
        .setProtocIn(new FileInputStream(config.protoc))
        .setFilename(filePath.toString())
        .setSchema(Files.readString(filePath))
        .setAuditStamp(auditStamp)
        .setFabricType(config.fabricType)
        .setGithubOrganization(config.githubOrg)
        .setSlackTeamId(config.slackId)
        .setSubType(config.subType)
        .build();

    // 5. Emit all MCPs
    dataset.getAllMetadataChangeProposals()
        .flatMap(Collection::stream)
        .forEach(mcpw -> emitter.emit(mcpw, null).get());
});

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment