Principle:Datahub project Datahub Proto2DataHub Configuration
| Field | Value |
|---|---|
| Principle Name | Proto2DataHub_Configuration |
| Category | Tool Configuration |
| Workflow | Protobuf_Schema_Ingestion |
| Repository | https://github.com/datahub-project/datahub |
| Implemented By | Implementation:Datahub_project_Datahub_Proto2DataHub_Main |
| Last Updated | 2026-02-09 17:00 GMT |
Overview
Description
Proto2DataHub Configuration is the principle governing how the schema-to-metadata conversion tool is configured through command-line arguments and environment variables. The Proto2DataHub tool uses Apache Commons CLI to define a structured set of options that control every aspect of the protobuf ingestion pipeline -- from input file selection and platform designation to transport mechanism and output formatting.
This principle establishes that configuration should be declarative, validated, and defaulted sensibly. Every required parameter is enforced at parse time, optional parameters carry reasonable defaults (e.g., platform defaults to kafka, environment defaults to DEV), and invalid configurations are rejected with clear error messages before any processing begins.
Usage
The configuration principle is applied whenever the Proto2DataHub tool is invoked, whether from a CI/CD pipeline, a shell script, or direct command-line execution. Configuration parameters fall into several categories:
- Input specification: What protobuf files to process (
--descriptor,--file,--directory,--exclude). - DataHub connection: Where and how to send metadata (
--datahub_api,--datahub_token,--transport). - Metadata enrichment: Additional context for generated metadata (
--platform,--env,--github_org,--slack_id,--subtype). - Output control: How results are delivered (
--transport,--filename).
Theoretical Basis
CLI Argument Parsing Pattern
The Apache Commons CLI library provides a declarative approach to command-line argument definition. Each option is defined as an Option object with:
- Long name: The
--flagidentifier used on the command line. - Required flag: Whether the option must be provided.
- Argument presence: Whether the option takes a value.
- Description: Help text for usage display.
This pattern separates option definition from option parsing from option consumption. Options are defined as static constants, parsed by the DefaultParser, and consumed by the AppConfig constructor. This three-phase approach ensures that:
- Adding a new option requires changes in exactly one place (the option constant definition).
- Parsing logic is handled by the library, not by custom code.
- Validation is centralized in the
AppConfig.validate()method.
Batch Schema Processing Configuration
The configuration model supports two primary modes of operation:
Single-file mode (--file): Processes a single protobuf source file against a compiled descriptor set. This mode is suitable for targeted ingestion of individual schemas.
Directory mode (--directory with optional --exclude): Walks a directory tree to discover all .proto files, optionally excluding paths matching glob patterns. This mode enables batch processing of entire schema repositories.
Both modes require a pre-compiled descriptor set (--descriptor) that contains the binary protobuf descriptors. The descriptor set may be a single .dsc file covering the entire repository or individual .protoc files corresponding to each source file.
Environment Variable Fallback
The configuration supports an environment variable fallback pattern where CLI arguments take precedence over environment variables, which in turn take precedence over hardcoded defaults:
| Parameter | CLI Flag | Environment Variable | Default |
|---|---|---|---|
| DataHub API | --datahub_api |
DATAHUB_API |
http://localhost:8080
|
| Auth Token | --datahub_token |
DATAHUB_TOKEN |
(empty) |
| User | --datahub_user |
DATAHUB_USER |
datahub
|
| Environment | --env |
DATAHUB_ENV |
DEV
|
| GitHub Org | --github_org |
DATAHUB_GITHUBORG |
(none) |
| Slack Team ID | --slack_id |
DATAHUB_SLACKID |
(none) |
This layered approach enables the tool to be configured differently across environments (local development, CI, production) without changing the invocation command.
Validation Before Processing
The AppConfig.validate() method enforces several constraints before any processing begins:
- If transport is
FILE, a filename must be provided. - The descriptor file must exist and be a regular file.
- Either
--fileor--directorymust be specified (but not neither). - If a Slack team ID is provided, it must start with the letter
T(per Slack conventions).
This fail-fast approach prevents the tool from starting expensive processing operations only to fail partway through due to misconfiguration.