Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Datahub project Datahub Protobuf Annotation

From Leeroopedia


Field Value
Principle Name Protobuf_Annotation
Category Governance Metadata
Workflow Protobuf_Schema_Ingestion
Repository https://github.com/datahub-project/datahub
Implemented By Implementation:Datahub_project_Datahub_Meta_Proto_Custom_Options
Last Updated 2026-02-09 17:00 GMT

Overview

Description

Protobuf Annotation is the principle of embedding governance metadata directly within Protocol Buffer schema definitions using custom protobuf extensions. Rather than maintaining metadata in external systems or configuration files that can drift out of sync with the schemas they describe, this principle advocates for a schema-as-documentation pattern where ownership, tags, domains, deprecation status, and other governance metadata live alongside the field and message definitions they annotate.

This approach leverages the protobuf custom options mechanism -- a first-class language feature that allows users to define typed extension fields on standard protobuf descriptor types (file options, message options, field options). By importing a shared meta.proto definition, schema authors can attach structured governance metadata that is preserved through compilation and subsequently extracted by the DataHub ingestion pipeline.

Usage

Schema authors apply annotations directly in their .proto files by importing meta.proto and using the defined custom options. These annotations are then automatically extracted during the DataHub protobuf ingestion process and mapped to corresponding DataHub aspects.

Common usage patterns include:

  • Ownership declaration: Specifying team or individual ownership of a schema at the message or file level.
  • Tag attachment: Labeling schemas or individual fields with governance tags (e.g., PII, Confidential).
  • Domain classification: Associating schemas with business domains for organizational clarity.
  • Deprecation marking: Indicating that a schema or field is deprecated with an explanatory message.
  • Primary key identification: Marking fields that serve as primary keys for downstream consumers.

Theoretical Basis

Schema-as-Documentation Pattern

The schema-as-documentation pattern is rooted in the principle that metadata is most accurate when it is co-located with the artifact it describes. This principle draws from several software engineering practices:

  1. Code-as-documentation: Just as Javadoc comments live alongside the Java code they document, governance metadata should live alongside schema definitions. This minimizes drift and ensures that schema changes and metadata updates go through the same review process.
  2. Single source of truth: When metadata lives in a separate system, there is always a risk that the schema evolves without corresponding metadata updates. By embedding metadata in the schema itself, the schema file becomes the authoritative source for both structural and governance information.
  3. Declarative governance: Rather than requiring imperative actions (e.g., "after creating a schema, go to the governance portal and set ownership"), governance metadata is declared alongside the schema and automatically ingested.

Custom Protobuf Extensions for Governance Metadata

Protocol Buffers support custom options through the extension mechanism. The meta.proto file defines extensions on standard protobuf option types:

  • FileOptions extensions: Apply to an entire .proto file (e.g., file-level ownership).
  • MessageOptions extensions: Apply to a specific message type (e.g., message-level tags, domains).
  • FieldOptions extensions: Apply to individual fields (e.g., is_primary_key, field-level tags).

These extensions are type-safe -- the protobuf compiler validates that annotation values conform to their declared types. For example, an ownership type annotation must be a valid OwnershipType enum value, and a tag annotation must be a string. This provides compile-time validation of governance metadata, preventing malformed annotations from entering the system.

Mapping to DataHub Aspects

Each annotation type maps to a specific DataHub aspect:

Annotation DataHub Aspect Scope
meta.ownership Ownership Message/File
meta.tag GlobalTags Message/Field
meta.domain Domains Message/File
meta.deprecation Deprecation Message
meta.is_primary_key SchemaField (isPrimaryKey) Field

The visitor classes in the DataHub protobuf module (OwnershipVisitor, TagAssociationVisitor, DomainVisitor, DeprecationVisitor) are responsible for extracting these annotations from the compiled descriptor set and transforming them into the corresponding DataHub aspect types.

Benefits Over External Metadata Management

  • Atomic changes: Schema structure and governance metadata change in the same commit.
  • Code review: Metadata changes are visible in pull request diffs.
  • Versioning: Metadata history is tracked alongside schema history in version control.
  • Automation: No manual steps required to synchronize metadata after schema changes.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment