Implementation:Datahub project Datahub Proto2DataHub
| Knowledge Sources | |
|---|---|
| Domains | Protobuf_Ingestion |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
A CLI application that reads protobuf descriptor and source files and emits DataHub metadata change proposals (MCPs) for each protobuf message, converting protobuf schemas into DataHub dataset entities.
Description
Proto2DataHub is the main entry point for the protobuf-to-DataHub ingestion pipeline. It is a command-line tool built with Apache Commons CLI that accepts a compiled protobuf descriptor file (.dsc or .protoc) alongside individual .proto source files or a directory of source files. For each protobuf source file, it constructs a ProtobufDataset and emits the resulting metadata to DataHub via a configurable transport (REST, Kafka, or file).
Configuration options include:
- --platform -- The data platform (defaults to
kafka) - --descriptor -- The generated protobuf descriptor file (required)
- --file / --directory -- Source file or root directory of proto files
- --env -- Environment fabric type (DEV, PROD, etc.)
- --transport -- REST (default), Kafka, or file output
- --github_org / --slack_id -- Translate comment annotations to URLs
- --subtype -- Custom subtype for entities (defaults to
schema) - --exclude -- Glob patterns to exclude files when using --directory
The tool also supports environment variables (DATAHUB_API, DATAHUB_TOKEN, DATAHUB_USER, DATAHUB_ENV) as fallbacks for CLI options. An AppConfig inner class handles validation of all configuration parameters.
Usage
Use this tool as a standalone Java application to ingest protobuf schemas into DataHub. It is typically run as part of a CI/CD pipeline or a manual ingestion workflow after protobuf compilation. The tool supports both single-file and batch (directory) modes, making it suitable for processing an entire protobuf repository.
Code Reference
Source Location
- Repository: Datahub_project_Datahub
- File: metadata-integration/java/datahub-protobuf/src/main/java/datahub/protobuf/Proto2DataHub.java
Signature
public class Proto2DataHub {
public static void main(String[] args) throws Exception;
enum TransportOptions { REST, KAFKA, FILE }
static class AppConfig {
AppConfig(CommandLine cli);
private AppConfig validate() throws Exception;
}
}
Import
import datahub.protobuf.Proto2DataHub;
I/O Contract
| Input | Type | Description |
|---|---|---|
| --descriptor | File path | Compiled protobuf descriptor file (.dsc / .protoc) |
| --file | File path | Individual .proto source file |
| --directory | Directory path | Root directory containing .proto source files |
| --platform | String | Data platform identifier (e.g., kafka, snowflake) |
| --datahub_api | URL | DataHub GMS REST API endpoint |
| --transport | Enum | Transport type: rest, kafka, or file |
| Output | Type | Description |
|---|---|---|
| MCPs | MetadataChangeProposalWrapper | Emitted to DataHub via REST, file, or Kafka transport |
| Exit code | int | 0 on success, 1 on partial/full failure |
Usage Examples
// CLI invocation: single file mode
// java -jar datahub-protobuf.jar \
// --descriptor my_repo.dsc \
// --file src/main/proto/MyEvent.proto \
// --platform kafka \
// --env PROD \
// --github_org datahub-project
// CLI invocation: directory mode with excludes
// java -jar datahub-protobuf.jar \
// --descriptor my_repo.dsc \
// --directory src/main/proto/ \
// --exclude "build/**,generated/**" \
// --transport file \
// --filename output.json