Implementation:Datahub project Datahub Proto2DataHub Main
| Field | Value |
|---|---|
| Implementation Name | Proto2DataHub_Main |
| Type | API Doc |
| Workflow | Protobuf_Schema_Ingestion |
| Repository | https://github.com/datahub-project/datahub |
| Implements | Principle:Datahub_project_Datahub_Proto2DataHub_Configuration |
| Last Updated | 2026-02-09 17:00 GMT |
Overview
Description
Proto2DataHub Main is the entry-point class for the DataHub protobuf ingestion tool. It provides a command-line interface built on Apache Commons CLI that parses user-provided arguments, validates the configuration, instantiates the appropriate emitter transport, and orchestrates the end-to-end pipeline of reading protobuf files, constructing ProtobufDataset instances, and emitting Metadata Change Proposals to DataHub.
The class is packaged as the main class of the datahub-protobuf shadow JAR and can be invoked directly via java -jar or through the Gradle run task.
Usage
The tool is invoked from the command line with a set of required and optional flags that configure input sources, DataHub connection parameters, and metadata enrichment options.
Code Reference
Source Location
metadata-integration/java/datahub-protobuf/src/main/java/datahub/protobuf/Proto2DataHub.java, lines 1-441.
Signature
public class Proto2DataHub {
public static void main(String[] args) throws Exception
}
Import
import datahub.protobuf.Proto2DataHub;
I/O Contract
| Direction | Type | Description |
|---|---|---|
| Input | String[] args |
Command-line arguments parsed by Apache Commons CLI. |
| Input | .protoc / .dsc descriptor file |
Binary protobuf descriptor set file specified via --descriptor.
|
| Input | .proto source file(s) |
Protobuf source files specified via --file or discovered via --directory.
|
| Output | MCPs emitted to DataHub | Metadata Change Proposals sent via the configured transport (REST, File, or Kafka). |
| Output | Exit code | 0 on full success, 1 on any emission failure.
|
| Output | Console status report | Summary line reporting total events emitted and files processed. |
CLI Options
| Flag | Type | Required | Default | Description |
|---|---|---|---|---|
--descriptor |
String | Yes | -- | Path to the compiled protobuf descriptor file (.protoc or .dsc).
|
--file |
String | Conditional | -- | Path to a single protobuf source file. Required if --directory is not specified.
|
--directory |
String | Conditional | -- | Root directory containing protobuf source files. Required if --file is not specified.
|
--exclude |
String (comma-separated) | No | -- | Glob patterns to exclude files when using --directory. E.g., "build/**,generated/**".
|
--message_name |
String | No | (auto-detect) | Fully qualified protobuf message name to read from the descriptor. |
--datahub_api |
String | No | http://localhost:8080 |
DataHub GMS API endpoint URL. |
--datahub_token |
String | No | (empty) | Authentication token for DataHub API access. |
--datahub_user |
String | No | datahub |
DataHub user to attribute the ingestion to. |
--platform |
String | No | kafka |
Data platform identifier (e.g., kafka, snowflake).
|
--env |
String | No | DEV |
FabricType environment (e.g., DEV, PROD, STAGING).
|
--transport |
String | No | rest |
Transport mechanism: rest, kafka, or file.
|
--filename |
String | Conditional | -- | Output filename when using file transport.
|
--github_org |
String | No | -- | GitHub organization for resolving team references in comments. |
--slack_id |
String | No | -- | Slack team ID for resolving channel references in comments. Must start with T.
|
--subtype |
String | No | schema |
Custom subtype to attach to all entities (e.g., event, topic).
|
-protocProp |
Flag | No | false |
Store the base64-encoded protoc as a custom property on the dataset. |
--help |
Flag | No | -- | Print usage help and exit. |
Usage Examples
Single File Ingestion via REST
java -jar datahub-protobuf.jar \
--descriptor schemas/my_schema.protoc \
--file schemas/my_schema.proto \
--datahub_api http://datahub-gms:8080 \
--datahub_token eyJhbGciOiJIUzI1... \
--platform kafka \
--env PROD
Directory Batch Ingestion
java -jar datahub-protobuf.jar \
--descriptor all_schemas.dsc \
--directory ./schemas \
--exclude "build/**,test/**" \
--datahub_api http://datahub-gms:8080 \
--platform kafka \
--env PROD \
--github_org datahub-project \
--slack_id TUMKD5EGJ
File Transport for Offline Ingestion
java -jar datahub-protobuf.jar \
--descriptor schemas/events.protoc \
--directory ./schemas/events \
--transport file \
--filename output_mcps.json
Legacy Two-Argument Format
The tool supports a legacy invocation format for backward compatibility:
java -jar datahub-protobuf.jar my_schema.protoc my_schema.proto
This is automatically translated to --descriptor my_schema.protoc --file my_schema.proto.
Internal Processing Flow
// 1. Parse CLI arguments
CommandLine cli = parser.parse(options, args);
AppConfig config = new AppConfig(cli).validate();
// 2. Create emitter based on transport
Emitter emitter = RestEmitter.create(b -> b.server(config.datahubAPI).token(config.datahubToken));
// 3. Create audit stamp
AuditStamp auditStamp = new AuditStamp()
.setTime(System.currentTimeMillis())
.setActor(new CorpuserUrn(config.datahubUser));
// 4. Iterate over input files
filePathStream.forEach(filePath -> {
ProtobufDataset dataset = ProtobufDataset.builder()
.setDataPlatformUrn(new DataPlatformUrn(config.dataPlatform))
.setProtocIn(new FileInputStream(config.protoc))
.setFilename(filePath.toString())
.setSchema(Files.readString(filePath))
.setAuditStamp(auditStamp)
.setFabricType(config.fabricType)
.setGithubOrganization(config.githubOrg)
.setSlackTeamId(config.slackId)
.setSubType(config.subType)
.build();
// 5. Emit all MCPs
dataset.getAllMetadataChangeProposals()
.flatMap(Collection::stream)
.forEach(mcpw -> emitter.emit(mcpw, null).get());
});
Related Pages
- Principle:Datahub_project_Datahub_Proto2DataHub_Configuration
- Implementation:Datahub_project_Datahub_ProtobufDataset_Builder
- Implementation:Datahub_project_Datahub_Proto2DataHub_RestEmitter_Emit
- Implementation:Datahub_project_Datahub_Protoc_Descriptor_Set_Out
- Environment:Datahub_project_Datahub_Java_Build