Principle:Datahub project Datahub Protobuf Schema Conversion

Field	Value
Principle Name	Protobuf_Schema_Conversion
Category	Schema Transformation
Workflow	Protobuf_Schema_Ingestion
Repository	https://github.com/datahub-project/datahub
Implemented By	Implementation:Datahub_project_Datahub_ProtobufDataset_Builder
Last Updated	2026-02-09 17:00 GMT

Overview

Description

Protobuf Schema Conversion is the principle governing the transformation of compiled protobuf descriptor sets into rich DataHub metadata aspects. This is the core transformation step in the protobuf ingestion pipeline, where binary descriptor data is parsed into a graph-based type model, traversed using the visitor pattern, and converted into a stream of MetadataChangeProposalWrapper objects representing DataHub aspects such as SchemaMetadata, Ownership, GlobalTags, Domains, DatasetProperties, and more.

The principle establishes that schema conversion must be exhaustive (every relevant piece of schema information is captured), structured (using well-defined visitor interfaces), and composable (multiple independent visitors can be combined to produce the full set of metadata aspects).

Usage

This principle is applied at the heart of the protobuf ingestion pipeline, after compilation and before emission. The ProtobufDataset class orchestrates the conversion by:

Parsing the binary descriptor set into a FileDescriptorSet.
Constructing a ProtobufGraph (a directed graph of messages, fields, and type relationships).
Applying a set of visitor classes to the graph to extract metadata.
Producing a stream of MetadataChangeProposalWrapper collections.

Typical scenarios include:

Full schema ingestion: Converting all messages in a proto file into DataHub datasets with complete metadata.
Selective field extraction: Using the SchemaFieldVisitor to produce SchemaField records with type information and comments.
Governance extraction: Using specialized visitors to extract ownership, tags, and domain annotations.

Theoretical Basis

Visitor Pattern for Metadata Extraction

The conversion pipeline uses the Visitor pattern as its primary architectural mechanism. The ProtobufModelVisitor<T> interface defines two visit methods:

visitGraph(VisitContext): Called once per graph traversal, for extracting graph-level (dataset-level) metadata.
visitField(ProtobufField, VisitContext): Called for each field vertex in the graph, for extracting field-level metadata.

This pattern provides several advantages:

Separation of concerns: Each visitor is responsible for extracting exactly one type of metadata. The OwnershipVisitor extracts ownership, the TagAssociationVisitor extracts tags, and so on. No single class needs to understand the full breadth of DataHub's aspect model.
Composability: Visitors are registered as lists in the DatasetVisitor.Builder and executed in sequence. New metadata types can be supported by adding a new visitor without modifying existing ones.
Testability: Each visitor can be unit tested in isolation against specific proto graph structures.

The DatasetVisitor acts as the composite visitor that aggregates results from sub-visitors into a stream of MetadataChangeProposalWrapper objects. It combines results from:

datasetPropertyVisitors (KafkaTopicPropertyVisitor, PropertyVisitor)
institutionalMemoryMetadataVisitors (InstitutionalMemoryVisitor)
tagAssociationVisitors (TagAssociationVisitor)
termAssociationVisitors (TermAssociationVisitor)
ownershipVisitors (OwnershipVisitor)
domainVisitors (DomainVisitor)
descriptionVisitor (DescriptionVisitor)
deprecationVisitor (DeprecationVisitor)

Graph-Based Type Model

The ProtobufGraph class extends DefaultDirectedGraph from the JGraphT library, modeling the protobuf type system as a directed graph where:

Vertices are ProtobufElement instances: ProtobufMessage, ProtobufField, ProtobufOneOfField, and ProtobufEnum.
Edges are FieldTypeEdge instances encoding the containment and type relationships between messages and their fields.

This graph representation enables:

Root message autodetection: The graph identifies the root message by finding a message vertex with no incoming edges whose children have no other parents.
Path computation: The AllDirectedPaths algorithm computes all paths from the root message to any field, which is used to construct DataHub fieldPath strings (e.g., [type=MyMessage].[type=string].field_name).
Google wrapper flattening: Well-known wrapper types like google.protobuf.StringValue are automatically flattened to their underlying primitive types.

DFS Traversal for Exhaustive Analysis

The graph's accept method performs a traversal that visits every vertex, ensuring no schema element is missed. The traversal combines:

Graph-level visits: Each visitor's visitGraph method is called first, allowing extraction of message-level metadata.
Vertex-level visits: Each vertex delegates to the visitor's visitField method through the ProtobufElement.accept polymorphic dispatch.

This two-phase traversal ensures that both dataset-level aspects (ownership, tags, domains) and field-level aspects (schema fields with types, descriptions, and paths) are fully captured.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment