Principle:Datahub project Datahub Protobuf Schema Conversion
| Field | Value |
|---|---|
| Principle Name | Protobuf_Schema_Conversion |
| Category | Schema Transformation |
| Workflow | Protobuf_Schema_Ingestion |
| Repository | https://github.com/datahub-project/datahub |
| Implemented By | Implementation:Datahub_project_Datahub_ProtobufDataset_Builder |
| Last Updated | 2026-02-09 17:00 GMT |
Overview
Description
Protobuf Schema Conversion is the principle governing the transformation of compiled protobuf descriptor sets into rich DataHub metadata aspects. This is the core transformation step in the protobuf ingestion pipeline, where binary descriptor data is parsed into a graph-based type model, traversed using the visitor pattern, and converted into a stream of MetadataChangeProposalWrapper objects representing DataHub aspects such as SchemaMetadata, Ownership, GlobalTags, Domains, DatasetProperties, and more.
The principle establishes that schema conversion must be exhaustive (every relevant piece of schema information is captured), structured (using well-defined visitor interfaces), and composable (multiple independent visitors can be combined to produce the full set of metadata aspects).
Usage
This principle is applied at the heart of the protobuf ingestion pipeline, after compilation and before emission. The ProtobufDataset class orchestrates the conversion by:
- Parsing the binary descriptor set into a
FileDescriptorSet. - Constructing a
ProtobufGraph(a directed graph of messages, fields, and type relationships). - Applying a set of visitor classes to the graph to extract metadata.
- Producing a stream of
MetadataChangeProposalWrappercollections.
Typical scenarios include:
- Full schema ingestion: Converting all messages in a proto file into DataHub datasets with complete metadata.
- Selective field extraction: Using the
SchemaFieldVisitorto produceSchemaFieldrecords with type information and comments. - Governance extraction: Using specialized visitors to extract ownership, tags, and domain annotations.
Theoretical Basis
Visitor Pattern for Metadata Extraction
The conversion pipeline uses the Visitor pattern as its primary architectural mechanism. The ProtobufModelVisitor<T> interface defines two visit methods:
visitGraph(VisitContext): Called once per graph traversal, for extracting graph-level (dataset-level) metadata.visitField(ProtobufField, VisitContext): Called for each field vertex in the graph, for extracting field-level metadata.
This pattern provides several advantages:
- Separation of concerns: Each visitor is responsible for extracting exactly one type of metadata. The
OwnershipVisitorextracts ownership, theTagAssociationVisitorextracts tags, and so on. No single class needs to understand the full breadth of DataHub's aspect model. - Composability: Visitors are registered as lists in the
DatasetVisitor.Builderand executed in sequence. New metadata types can be supported by adding a new visitor without modifying existing ones. - Testability: Each visitor can be unit tested in isolation against specific proto graph structures.
The DatasetVisitor acts as the composite visitor that aggregates results from sub-visitors into a stream of MetadataChangeProposalWrapper objects. It combines results from:
datasetPropertyVisitors(KafkaTopicPropertyVisitor, PropertyVisitor)institutionalMemoryMetadataVisitors(InstitutionalMemoryVisitor)tagAssociationVisitors(TagAssociationVisitor)termAssociationVisitors(TermAssociationVisitor)ownershipVisitors(OwnershipVisitor)domainVisitors(DomainVisitor)descriptionVisitor(DescriptionVisitor)deprecationVisitor(DeprecationVisitor)
Graph-Based Type Model
The ProtobufGraph class extends DefaultDirectedGraph from the JGraphT library, modeling the protobuf type system as a directed graph where:
- Vertices are
ProtobufElementinstances:ProtobufMessage,ProtobufField,ProtobufOneOfField, andProtobufEnum. - Edges are
FieldTypeEdgeinstances encoding the containment and type relationships between messages and their fields.
This graph representation enables:
- Root message autodetection: The graph identifies the root message by finding a message vertex with no incoming edges whose children have no other parents.
- Path computation: The
AllDirectedPathsalgorithm computes all paths from the root message to any field, which is used to construct DataHubfieldPathstrings (e.g.,[type=MyMessage].[type=string].field_name). - Google wrapper flattening: Well-known wrapper types like
google.protobuf.StringValueare automatically flattened to their underlying primitive types.
DFS Traversal for Exhaustive Analysis
The graph's accept method performs a traversal that visits every vertex, ensuring no schema element is missed. The traversal combines:
- Graph-level visits: Each visitor's
visitGraphmethod is called first, allowing extraction of message-level metadata. - Vertex-level visits: Each vertex delegates to the visitor's
visitFieldmethod through theProtobufElement.acceptpolymorphic dispatch.
This two-phase traversal ensures that both dataset-level aspects (ownership, tags, domains) and field-level aspects (schema fields with types, descriptions, and paths) are fully captured.