Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Datahub project Datahub Spark Lineage JAR Dependency

From Leeroopedia


Metadata

Field Value
implementation_name Spark Lineage JAR Dependency
type External Tool Doc
status Active
last_updated 2026-02-10
source_file metadata-integration/java/acryl-spark-lineage/build.gradle
lines L24-211
repository datahub-project/datahub
domains Data_Lineage, Apache_Spark, Metadata_Management

Overview

This implementation defines the build configuration for the DataHub Spark lineage agent shadow JAR. The build.gradle file specifies all dependencies, Scala version variants, shadow JAR relocations, and publishing configuration for the acryl-spark-lineage artifact.

Description

The build configuration produces shadow JARs for Scala 2.12 and 2.13, each containing the DataHub Spark lineage agent and all its transitive dependencies (except those provided by Spark itself). The build uses the Gradle Shadow plugin to relocate over 30 dependency packages into the io.acryl.shaded.* namespace.

Source Code Reference

File: metadata-integration/java/acryl-spark-lineage/build.gradle

Dependencies (L24-90)

dependencies {
  constraints {
    provided(externalDependency.hadoopClient) {
      because 'Needed for tie breaking of guava version need for spark and wiremock'
    }
    provided(externalDependency.hadoopCommon3) {
      because 'required for org.apache.hadoop.util.StopWatch'
    }
    provided(externalDependency.commonsIo) {
      because 'required for org.apache.commons.io.Charsets that is used internally'
    }
  }

  provided(externalDependency.sparkSql)
  provided(externalDependency.sparkHive)
  implementation 'org.slf4j:slf4j-log4j12:2.0.7'
  implementation externalDependency.httpClient
  implementation externalDependency.typesafeConfig
  implementation externalDependency.slf4jApi
  compileOnly externalDependency.lombok
  annotationProcessor externalDependency.lombok
  implementation externalDependency.json
  implementation project(':metadata-integration:java:openlineage-converter')
  implementation project(path: ':metadata-integration:java:datahub-client')

  // Default to Scala 2.12 for main compilation
  implementation "io.openlineage:openlineage-spark_2.12:$openLineageVersion"
  compileOnly "org.apache.iceberg:iceberg-spark3-runtime:0.12.1"
  compileOnly("org.apache.spark:spark-sql_2.12:3.3.4") { /* jetty exclusions */ }
}

Scala-Versioned Shadow JAR Tasks (L115-222)

Each Scala version gets a dedicated shadow JAR task that creates a detached configuration with the correct OpenLineage Spark dependency:

scalaVersions.each { sv ->
  tasks.register("shadowJar_${sv.replace('.', '_')}",
      com.github.jengelman.gradle.plugins.shadow.tasks.ShadowJar) {
    zip64 = true
    archiveClassifier = ''
    archiveBaseName = "acryl-spark-lineage_${sv}"
    mergeServiceFiles()

    // Scala-specific OpenLineage dependency
    scalaConfig.dependencies.add(
      project.dependencies.create(
        "io.openlineage:openlineage-spark_${sv}:${openLineageVersion}"))

    configurations = [scalaConfig]
    // ... exclusions and relocations
  }
}

Shadow JAR Relocations (L178-211)

The following packages are relocated to avoid classpath conflicts:

relocate 'com.fasterxml.jackson', 'io.acryl.shaded.jackson'
relocate 'com.google.common', 'io.acryl.shaded.com.google.common'
relocate 'org.apache.hc', 'io.acryl.shaded.http'
relocate 'org.apache.commons.codec', 'io.acryl.shaded.org.apache.commons.codec'
relocate 'org.apache.commons.compress', 'io.acryl.shaded.org.apache.commons.compress'
relocate 'org.apache.commons.lang3', 'io.acryl.shaded.org.apache.commons.lang3'
relocate 'com.typesafe', 'io.acryl.shaded.com.typesafe'
relocate 'io.netty', 'io.acryl.shaded.io.netty'
relocate 'org.springframework', 'io.acryl.shaded.org.springframework'
relocate 'org.yaml', 'io.acryl.shaded.org.yaml'
relocate 'com.github.benmanes.caffeine', 'io.acryl.shaded.com.github.benmanes.caffeine'
relocate 'org.checkerframework', 'io.acryl.shaded.org.checkerframework'
relocate 'com.google.errorprone', 'io.acryl.shaded.com.google.errorprone'
relocate 'javax.annotation', 'io.acryl.shaded.javax.annotation'
relocate 'org.reflections', 'io.acryl.shaded.org.reflections'
relocate 'org.json', 'io.acryl.shaded.org.json'
relocate 'com.github', 'io.acryl.shaded.com.github'
relocate 'io.opentracing', 'io.acryl.shaded.io.opentracing'
relocate 'ch.qos', 'io.acryl.shaded.ch.qos'
relocate 'javassist', 'io.acryl.shaded.javassist'

I/O Contract

Aspect Details
Input Java source code, Gradle dependency declarations, Scala version parameter (2.12 or 2.13)
Output Shadow JAR: acryl-spark-lineage_2.12-VERSION.jar or acryl-spark-lineage_2.13-VERSION.jar
Build Command ./gradlew :metadata-integration:java:acryl-spark-lineage:shadowJar
Published Artifact io.acryl:acryl-spark-lineage_2.12:VERSION or io.acryl:acryl-spark-lineage_2.13:VERSION

Usage Examples

Spark Submit with Maven Packages

spark-submit \
  --packages io.acryl:acryl-spark-lineage_2.12:0.13.1 \
  --conf "spark.extraListeners=datahub.spark.DatahubSparkListener" \
  --conf "spark.datahub.rest.server=http://localhost:8080" \
  my_spark_app.py

Spark Submit with Local JAR

spark-submit \
  --jars /path/to/acryl-spark-lineage_2.12-0.13.1.jar \
  --conf "spark.extraListeners=datahub.spark.DatahubSparkListener" \
  --conf "spark.datahub.rest.server=http://localhost:8080" \
  my_spark_app.py

Scala 2.13 Variant

spark-submit \
  --packages io.acryl:acryl-spark-lineage_2.13:0.13.1 \
  --conf "spark.extraListeners=datahub.spark.DatahubSparkListener" \
  --conf "spark.datahub.rest.server=http://localhost:8080" \
  my_spark_app.py

Key Configuration Notes

  • The provided configuration excludes Spark SQL, Spark Hive, Hadoop, and Commons IO from the shadow JAR since they are supplied by the Spark runtime
  • SLF4J and Logback are excluded from the shadow JAR (exclude(dependency("org.slf4j::")) and exclude(dependency("ch.qos.logback:")))
  • Native ZStandard libraries are excluded (exclude '**/libzstd-jni.*')
  • The zip64 = true flag enables support for large JARs exceeding 65535 entries
  • Service files are merged (mergeServiceFiles()) to preserve SPI registrations

Knowledge Sources

Related

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment