Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Workflow:Datahub project Datahub Java SDK Metadata Emission

From Leeroopedia


Knowledge Sources
Domains Data_Engineering, Metadata_Management, Java_SDK
Last Updated 2026-02-09 17:00 GMT

Overview

End-to-end process for programmatically emitting metadata to DataHub from Java applications using the V1 SDK emitter library.

Description

This workflow covers the low-level Java SDK (V1) approach for sending metadata to DataHub. It uses the MetadataChangeProposalWrapper pattern to construct metadata events and one of four emitter backends (REST, Kafka, File, S3) to deliver them. This approach is suitable for CI/CD pipelines, custom orchestrators, and applications that need fine-grained control over metadata emission.

Usage

Execute this workflow when you need to emit metadata to DataHub from a Java application, build tool, or CI/CD pipeline. This is the appropriate choice when you require low-level control over MetadataChangeProposal construction, need to integrate with existing Java infrastructure, or want to emit metadata via Kafka rather than REST.

Execution Steps

Step 1: Add SDK Dependency

Add the datahub-client library to your project's build configuration. The SDK is published to Maven Central and supports both Gradle and Maven build systems.

Key considerations:

  • The dependency groupId is io.acryl and artifactId is datahub-client
  • Include the appropriate version matching your DataHub deployment
  • The SDK bundles dependencies for REST, Kafka, File, and S3 emitters

Step 2: Create an Emitter Instance

Instantiate the appropriate emitter based on your transport preference. The REST emitter sends metadata over HTTP to the DataHub GMS endpoint. The Kafka emitter publishes directly to Kafka topics. File and S3 emitters write metadata to local files or cloud storage.

Key considerations:

  • RestEmitter requires a GMS server URL and optional auth token
  • KafkaEmitter requires bootstrap server and schema registry URLs
  • FileEmitter writes JSON to a specified local file path
  • S3Emitter writes JSON to an S3 bucket with configurable key prefix

Step 3: Construct Metadata Change Proposals

Build MetadataChangeProposalWrapper objects that describe the metadata you want to emit. Each wrapper contains an entity URN, an aspect name, and the aspect value. Use the provided builder utilities and Pegasus-generated aspect classes.

Key considerations:

  • URNs follow the format urn:li:entityType:(key components)
  • Aspect classes are generated from Avro/PDL schemas in metadata-models
  • Common aspects include DatasetProperties, SchemaMetadata, Ownership, and UpstreamLineage
  • The wrapper infers the aspect name from the aspect class type

Step 4: Emit Metadata Events

Send the constructed MCPs through the emitter. REST emission supports both blocking and non-blocking modes. Kafka emission is inherently asynchronous with callback support. Check the MetadataWriteResponse for success or error information.

Key considerations:

  • REST blocking mode returns a Future that resolves to MetadataWriteResponse
  • Kafka mode uses a Callback interface for async acknowledgement
  • Always call emitter.flush() for Kafka to ensure all pending events are delivered
  • Handle exceptions for network failures and server errors

Step 5: Close the Emitter

Close the emitter to release resources and ensure all pending metadata events are flushed. This is especially important for Kafka and File emitters which buffer events internally.

Key considerations:

  • Use try-with-resources or explicit close() calls
  • For Kafka, flush() before close() to ensure delivery
  • FileEmitter finalizes the JSON output on close

Execution Diagram

GitHub URL

Workflow Repository