Implementation:Datahub project Datahub AvroSchemaConverter
| Knowledge Sources | |
|---|---|
| Domains | Schema_Conversion |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Converts Apache Avro schemas into DataHub's SchemaMetadata format, implementing the SchemaConverter<Schema> interface and following the SchemaFieldPath Specification V2 for field path generation.
Description
AvroSchemaConverter is a Lombok @Builder class that transforms Avro Schema objects into DataHub SchemaMetadata with properly typed SchemaField entries. It handles the full spectrum of Avro types:
Primitive types: BOOLEAN, INT, LONG, FLOAT, DOUBLE, STRING, BYTES, FIXED -- mapped to corresponding DataHub types (BooleanType, NumberType, StringType, BytesType, FixedType).
Complex types:
- RECORD -- Mapped to
RecordType. Fields are recursively processed. Cyclic references are detected via avisitedRecordsset to prevent infinite recursion. - ARRAY -- Mapped to
ArrayType. For complex element types (records, arrays, maps, unions), the element is recursively processed. - MAP -- Mapped to
MapType. Value types are recursively processed if complex. - UNION -- If the union is a simple nullable type (two members, one null), it unwraps to the non-null type with nullable=true. Otherwise, it creates a
UnionTypeand processes each non-null member as a sub-field. - ENUM -- Mapped to
EnumTypewith allowed symbols appended to the description.
Logical types (date, time-micros, time-millis, timestamp-micros, timestamp-millis, decimal, uuid) are mapped via a static lookup table to their semantic DataHub equivalents (DateType, TimeType, NumberType, StringType).
Schema fingerprinting uses Avro's SchemaNormalization to compute an MD5 hash of the schema for the hash field.
Field properties -- JSON properties from both the field and its schema are serialized and stored in jsonProps. Field documentation and default values are combined into the description.
Usage
Use AvroSchemaConverter in the datahub-schematron library when you need to convert Avro schemas (from Kafka Schema Registry, Avro files, or Hive metastore) into DataHub schema metadata. It is the primary Avro schema processing component in the schematron converter pipeline.
Code Reference
Source Location
- Repository: Datahub_project_Datahub
- File: metadata-integration/java/datahub-schematron/lib/src/main/java/io/datahubproject/schematron/converters/avro/AvroSchemaConverter.java
Signature
@Slf4j
@Builder
public class AvroSchemaConverter implements SchemaConverter<Schema> {
@Override
public SchemaMetadata toDataHubSchema(
Schema schema,
boolean isKeySchema,
boolean defaultNullable,
DataPlatformUrn platformUrn,
String rawSchemaString);
// Internal processing methods
private void processSchema(Schema schema, FieldPath fieldPath, boolean defaultNullable, List<SchemaField> fields, Set<String> visitedRecords);
private void processField(Schema.Field field, FieldPath fieldPath, boolean defaultNullable, List<SchemaField> fields, ...);
private void processRecordField(...);
private void processArrayField(...);
private void processMapField(...);
private void processUnionField(...);
private void processEnumField(...);
private void processPrimitiveField(...);
}
Import
import io.datahubproject.schematron.converters.avro.AvroSchemaConverter;
I/O Contract
| Input | Type | Description |
|---|---|---|
| schema | org.apache.avro.Schema |
The Avro schema to convert |
| isKeySchema | boolean |
Whether this is a key schema (for Kafka) |
| defaultNullable | boolean |
Default nullability for fields |
| platformUrn | DataPlatformUrn |
Target data platform URN |
| rawSchemaString | String |
Optional raw schema string for fingerprinting |
| Output | Type | Description |
|---|---|---|
| SchemaMetadata | SchemaMetadata |
DataHub schema with fields, hash, platform schema, and field paths following V2 spec |
Usage Examples
AvroSchemaConverter converter = AvroSchemaConverter.builder().build();
Schema avroSchema = new Schema.Parser().parse(avroSchemaJson);
SchemaMetadata metadata = converter.toDataHubSchema(
avroSchema,
false, // not a key schema
false, // default not nullable
new DataPlatformUrn("kafka"),
avroSchemaJson // raw schema for fingerprinting
);
// metadata.getFields() contains typed SchemaField entries
// metadata.getHash() contains the MD5 fingerprint