Principle:Datahub project Datahub Protobuf Annotation
| Field | Value |
|---|---|
| Principle Name | Protobuf_Annotation |
| Category | Governance Metadata |
| Workflow | Protobuf_Schema_Ingestion |
| Repository | https://github.com/datahub-project/datahub |
| Implemented By | Implementation:Datahub_project_Datahub_Meta_Proto_Custom_Options |
| Last Updated | 2026-02-09 17:00 GMT |
Overview
Description
Protobuf Annotation is the principle of embedding governance metadata directly within Protocol Buffer schema definitions using custom protobuf extensions. Rather than maintaining metadata in external systems or configuration files that can drift out of sync with the schemas they describe, this principle advocates for a schema-as-documentation pattern where ownership, tags, domains, deprecation status, and other governance metadata live alongside the field and message definitions they annotate.
This approach leverages the protobuf custom options mechanism -- a first-class language feature that allows users to define typed extension fields on standard protobuf descriptor types (file options, message options, field options). By importing a shared meta.proto definition, schema authors can attach structured governance metadata that is preserved through compilation and subsequently extracted by the DataHub ingestion pipeline.
Usage
Schema authors apply annotations directly in their .proto files by importing meta.proto and using the defined custom options. These annotations are then automatically extracted during the DataHub protobuf ingestion process and mapped to corresponding DataHub aspects.
Common usage patterns include:
- Ownership declaration: Specifying team or individual ownership of a schema at the message or file level.
- Tag attachment: Labeling schemas or individual fields with governance tags (e.g., PII, Confidential).
- Domain classification: Associating schemas with business domains for organizational clarity.
- Deprecation marking: Indicating that a schema or field is deprecated with an explanatory message.
- Primary key identification: Marking fields that serve as primary keys for downstream consumers.
Theoretical Basis
Schema-as-Documentation Pattern
The schema-as-documentation pattern is rooted in the principle that metadata is most accurate when it is co-located with the artifact it describes. This principle draws from several software engineering practices:
- Code-as-documentation: Just as Javadoc comments live alongside the Java code they document, governance metadata should live alongside schema definitions. This minimizes drift and ensures that schema changes and metadata updates go through the same review process.
- Single source of truth: When metadata lives in a separate system, there is always a risk that the schema evolves without corresponding metadata updates. By embedding metadata in the schema itself, the schema file becomes the authoritative source for both structural and governance information.
- Declarative governance: Rather than requiring imperative actions (e.g., "after creating a schema, go to the governance portal and set ownership"), governance metadata is declared alongside the schema and automatically ingested.
Custom Protobuf Extensions for Governance Metadata
Protocol Buffers support custom options through the extension mechanism. The meta.proto file defines extensions on standard protobuf option types:
- FileOptions extensions: Apply to an entire
.protofile (e.g., file-level ownership). - MessageOptions extensions: Apply to a specific message type (e.g., message-level tags, domains).
- FieldOptions extensions: Apply to individual fields (e.g.,
is_primary_key, field-level tags).
These extensions are type-safe -- the protobuf compiler validates that annotation values conform to their declared types. For example, an ownership type annotation must be a valid OwnershipType enum value, and a tag annotation must be a string. This provides compile-time validation of governance metadata, preventing malformed annotations from entering the system.
Mapping to DataHub Aspects
Each annotation type maps to a specific DataHub aspect:
| Annotation | DataHub Aspect | Scope |
|---|---|---|
meta.ownership |
Ownership | Message/File |
meta.tag |
GlobalTags | Message/Field |
meta.domain |
Domains | Message/File |
meta.deprecation |
Deprecation | Message |
meta.is_primary_key |
SchemaField (isPrimaryKey) | Field |
The visitor classes in the DataHub protobuf module (OwnershipVisitor, TagAssociationVisitor, DomainVisitor, DeprecationVisitor) are responsible for extracting these annotations from the compiled descriptor set and transforming them into the corresponding DataHub aspect types.
Benefits Over External Metadata Management
- Atomic changes: Schema structure and governance metadata change in the same commit.
- Code review: Metadata changes are visible in pull request diffs.
- Versioning: Metadata history is tracked alongside schema history in version control.
- Automation: No manual steps required to synchronize metadata after schema changes.