Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Datahub project Datahub Proto2DataHub Configuration

From Leeroopedia


Field Value
Principle Name Proto2DataHub_Configuration
Category Tool Configuration
Workflow Protobuf_Schema_Ingestion
Repository https://github.com/datahub-project/datahub
Implemented By Implementation:Datahub_project_Datahub_Proto2DataHub_Main
Last Updated 2026-02-09 17:00 GMT

Overview

Description

Proto2DataHub Configuration is the principle governing how the schema-to-metadata conversion tool is configured through command-line arguments and environment variables. The Proto2DataHub tool uses Apache Commons CLI to define a structured set of options that control every aspect of the protobuf ingestion pipeline -- from input file selection and platform designation to transport mechanism and output formatting.

This principle establishes that configuration should be declarative, validated, and defaulted sensibly. Every required parameter is enforced at parse time, optional parameters carry reasonable defaults (e.g., platform defaults to kafka, environment defaults to DEV), and invalid configurations are rejected with clear error messages before any processing begins.

Usage

The configuration principle is applied whenever the Proto2DataHub tool is invoked, whether from a CI/CD pipeline, a shell script, or direct command-line execution. Configuration parameters fall into several categories:

  • Input specification: What protobuf files to process (--descriptor, --file, --directory, --exclude).
  • DataHub connection: Where and how to send metadata (--datahub_api, --datahub_token, --transport).
  • Metadata enrichment: Additional context for generated metadata (--platform, --env, --github_org, --slack_id, --subtype).
  • Output control: How results are delivered (--transport, --filename).

Theoretical Basis

CLI Argument Parsing Pattern

The Apache Commons CLI library provides a declarative approach to command-line argument definition. Each option is defined as an Option object with:

  • Long name: The --flag identifier used on the command line.
  • Required flag: Whether the option must be provided.
  • Argument presence: Whether the option takes a value.
  • Description: Help text for usage display.

This pattern separates option definition from option parsing from option consumption. Options are defined as static constants, parsed by the DefaultParser, and consumed by the AppConfig constructor. This three-phase approach ensures that:

  1. Adding a new option requires changes in exactly one place (the option constant definition).
  2. Parsing logic is handled by the library, not by custom code.
  3. Validation is centralized in the AppConfig.validate() method.

Batch Schema Processing Configuration

The configuration model supports two primary modes of operation:

Single-file mode (--file): Processes a single protobuf source file against a compiled descriptor set. This mode is suitable for targeted ingestion of individual schemas.

Directory mode (--directory with optional --exclude): Walks a directory tree to discover all .proto files, optionally excluding paths matching glob patterns. This mode enables batch processing of entire schema repositories.

Both modes require a pre-compiled descriptor set (--descriptor) that contains the binary protobuf descriptors. The descriptor set may be a single .dsc file covering the entire repository or individual .protoc files corresponding to each source file.

Environment Variable Fallback

The configuration supports an environment variable fallback pattern where CLI arguments take precedence over environment variables, which in turn take precedence over hardcoded defaults:

Parameter CLI Flag Environment Variable Default
DataHub API --datahub_api DATAHUB_API http://localhost:8080
Auth Token --datahub_token DATAHUB_TOKEN (empty)
User --datahub_user DATAHUB_USER datahub
Environment --env DATAHUB_ENV DEV
GitHub Org --github_org DATAHUB_GITHUBORG (none)
Slack Team ID --slack_id DATAHUB_SLACKID (none)

This layered approach enables the tool to be configured differently across environments (local development, CI, production) without changing the invocation command.

Validation Before Processing

The AppConfig.validate() method enforces several constraints before any processing begins:

  • If transport is FILE, a filename must be provided.
  • The descriptor file must exist and be a regular file.
  • Either --file or --directory must be specified (but not neither).
  • If a Slack team ID is provided, it must start with the letter T (per Slack conventions).

This fail-fast approach prevents the tool from starting expensive processing operations only to fail partway through due to misconfiguration.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment