Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Datahub project Datahub AvroSchemaConverter

From Leeroopedia


Knowledge Sources
Domains Schema_Conversion
Last Updated 2026-02-10 00:00 GMT

Overview

Converts Apache Avro schemas into DataHub's SchemaMetadata format, implementing the SchemaConverter<Schema> interface and following the SchemaFieldPath Specification V2 for field path generation.

Description

AvroSchemaConverter is a Lombok @Builder class that transforms Avro Schema objects into DataHub SchemaMetadata with properly typed SchemaField entries. It handles the full spectrum of Avro types:

Primitive types: BOOLEAN, INT, LONG, FLOAT, DOUBLE, STRING, BYTES, FIXED -- mapped to corresponding DataHub types (BooleanType, NumberType, StringType, BytesType, FixedType).

Complex types:

  • RECORD -- Mapped to RecordType. Fields are recursively processed. Cyclic references are detected via a visitedRecords set to prevent infinite recursion.
  • ARRAY -- Mapped to ArrayType. For complex element types (records, arrays, maps, unions), the element is recursively processed.
  • MAP -- Mapped to MapType. Value types are recursively processed if complex.
  • UNION -- If the union is a simple nullable type (two members, one null), it unwraps to the non-null type with nullable=true. Otherwise, it creates a UnionType and processes each non-null member as a sub-field.
  • ENUM -- Mapped to EnumType with allowed symbols appended to the description.

Logical types (date, time-micros, time-millis, timestamp-micros, timestamp-millis, decimal, uuid) are mapped via a static lookup table to their semantic DataHub equivalents (DateType, TimeType, NumberType, StringType).

Schema fingerprinting uses Avro's SchemaNormalization to compute an MD5 hash of the schema for the hash field.

Field properties -- JSON properties from both the field and its schema are serialized and stored in jsonProps. Field documentation and default values are combined into the description.

Usage

Use AvroSchemaConverter in the datahub-schematron library when you need to convert Avro schemas (from Kafka Schema Registry, Avro files, or Hive metastore) into DataHub schema metadata. It is the primary Avro schema processing component in the schematron converter pipeline.

Code Reference

Source Location

Signature

@Slf4j
@Builder
public class AvroSchemaConverter implements SchemaConverter<Schema> {

    @Override
    public SchemaMetadata toDataHubSchema(
        Schema schema,
        boolean isKeySchema,
        boolean defaultNullable,
        DataPlatformUrn platformUrn,
        String rawSchemaString);

    // Internal processing methods
    private void processSchema(Schema schema, FieldPath fieldPath, boolean defaultNullable, List<SchemaField> fields, Set<String> visitedRecords);
    private void processField(Schema.Field field, FieldPath fieldPath, boolean defaultNullable, List<SchemaField> fields, ...);
    private void processRecordField(...);
    private void processArrayField(...);
    private void processMapField(...);
    private void processUnionField(...);
    private void processEnumField(...);
    private void processPrimitiveField(...);
}

Import

import io.datahubproject.schematron.converters.avro.AvroSchemaConverter;

I/O Contract

Input Type Description
schema org.apache.avro.Schema The Avro schema to convert
isKeySchema boolean Whether this is a key schema (for Kafka)
defaultNullable boolean Default nullability for fields
platformUrn DataPlatformUrn Target data platform URN
rawSchemaString String Optional raw schema string for fingerprinting
Output Type Description
SchemaMetadata SchemaMetadata DataHub schema with fields, hash, platform schema, and field paths following V2 spec

Usage Examples

AvroSchemaConverter converter = AvroSchemaConverter.builder().build();

Schema avroSchema = new Schema.Parser().parse(avroSchemaJson);

SchemaMetadata metadata = converter.toDataHubSchema(
    avroSchema,
    false,               // not a key schema
    false,               // default not nullable
    new DataPlatformUrn("kafka"),
    avroSchemaJson       // raw schema for fingerprinting
);

// metadata.getFields() contains typed SchemaField entries
// metadata.getHash() contains the MD5 fingerprint

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment