Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Datahub project Datahub SDK Dependency Management

From Leeroopedia


Property Value
Principle Name SDK_Dependency_Management
Category Java_SDK_Metadata_Emission
Workflow Java_SDK_Metadata_Emission
Repository https://github.com/datahub-project/datahub
Last Updated 2026-02-09 17:00 GMT

Overview

Description

SDK Dependency Management is the principle of correctly declaring and resolving library dependencies in Java build systems (Maven and Gradle) so that the DataHub metadata emission SDK and all of its transitive dependencies are available on the classpath at compile time and runtime. The DataHub Java SDK is published under the io.acryl group and consists of multiple coordinated artifacts that must be resolved together: datahub-client (emitter implementations), datahub-event (event wrappers and formatters), and metadata-models (generated Avro/PDL model classes).

Proper dependency management ensures that version alignment across these artifacts is maintained, that transitive dependencies such as Apache HttpClient, Kafka client libraries, and Jackson serialization are pulled in automatically, and that there are no classpath conflicts with the host application.

Usage

This principle applies whenever a Java or JVM-based application needs to emit metadata to DataHub programmatically. Typical scenarios include:

  • CI/CD pipelines that publish dataset or pipeline metadata after successful builds.
  • Custom orchestrators that emit lineage and execution metadata from internal scheduling systems.
  • Spark or Flink jobs that report dataset-level schema and lineage information.
  • Microservices that register themselves and their data assets on startup.

The developer declares a single top-level dependency on io.acryl:datahub-client in their build file, and the build system resolves all transitive artifacts automatically.

Theoretical Basis

SDK Dependency Management draws on the Dependency Management pattern as formalized by Maven and Gradle build systems. The key theoretical concepts are:

Transitive Dependency Resolution -- When a project declares a dependency on artifact A, and artifact A itself depends on artifacts B and C, the build system automatically includes B and C in the dependency graph. The DataHub SDK leverages this property: declaring datahub-client transitively brings in datahub-event, metadata-models, Apache HttpClient, Jackson, and other required libraries.

Semantic Versioning -- The SDK follows semantic versioning conventions where the major version signals breaking changes, the minor version indicates new features, and the patch version covers bug fixes. All DataHub SDK artifacts share the same version number to ensure compatibility.

Bill of Materials (BOM) Pattern -- In complex dependency graphs, version alignment across multiple artifacts from the same project is achieved through coordinated releases. The DataHub SDK publishes all of its artifacts (client, event, models) under the same version to prevent mismatched combinations.

Dependency Scope -- Build systems distinguish between compile-time, runtime, and test dependencies. The SDK dependency is typically declared with the default compile scope (Maven) or implementation configuration (Gradle), making it available at both compile time and runtime.

Conflict Resolution -- When the host application and the SDK depend on different versions of a shared library (e.g., Jackson), the build system applies conflict resolution strategies such as nearest-wins (Maven) or highest-version-wins (Gradle). Developers may need to use dependency exclusions or forced versions to resolve conflicts.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment