Implementation:Datahub project Datahub Spark Lineage JAR Dependency
Metadata
| Field | Value |
|---|---|
| implementation_name | Spark Lineage JAR Dependency |
| type | External Tool Doc |
| status | Active |
| last_updated | 2026-02-10 |
| source_file | metadata-integration/java/acryl-spark-lineage/build.gradle
|
| lines | L24-211 |
| repository | datahub-project/datahub |
| domains | Data_Lineage, Apache_Spark, Metadata_Management |
Overview
This implementation defines the build configuration for the DataHub Spark lineage agent shadow JAR. The build.gradle file specifies all dependencies, Scala version variants, shadow JAR relocations, and publishing configuration for the acryl-spark-lineage artifact.
Description
The build configuration produces shadow JARs for Scala 2.12 and 2.13, each containing the DataHub Spark lineage agent and all its transitive dependencies (except those provided by Spark itself). The build uses the Gradle Shadow plugin to relocate over 30 dependency packages into the io.acryl.shaded.* namespace.
Source Code Reference
File: metadata-integration/java/acryl-spark-lineage/build.gradle
Dependencies (L24-90)
dependencies {
constraints {
provided(externalDependency.hadoopClient) {
because 'Needed for tie breaking of guava version need for spark and wiremock'
}
provided(externalDependency.hadoopCommon3) {
because 'required for org.apache.hadoop.util.StopWatch'
}
provided(externalDependency.commonsIo) {
because 'required for org.apache.commons.io.Charsets that is used internally'
}
}
provided(externalDependency.sparkSql)
provided(externalDependency.sparkHive)
implementation 'org.slf4j:slf4j-log4j12:2.0.7'
implementation externalDependency.httpClient
implementation externalDependency.typesafeConfig
implementation externalDependency.slf4jApi
compileOnly externalDependency.lombok
annotationProcessor externalDependency.lombok
implementation externalDependency.json
implementation project(':metadata-integration:java:openlineage-converter')
implementation project(path: ':metadata-integration:java:datahub-client')
// Default to Scala 2.12 for main compilation
implementation "io.openlineage:openlineage-spark_2.12:$openLineageVersion"
compileOnly "org.apache.iceberg:iceberg-spark3-runtime:0.12.1"
compileOnly("org.apache.spark:spark-sql_2.12:3.3.4") { /* jetty exclusions */ }
}
Scala-Versioned Shadow JAR Tasks (L115-222)
Each Scala version gets a dedicated shadow JAR task that creates a detached configuration with the correct OpenLineage Spark dependency:
scalaVersions.each { sv ->
tasks.register("shadowJar_${sv.replace('.', '_')}",
com.github.jengelman.gradle.plugins.shadow.tasks.ShadowJar) {
zip64 = true
archiveClassifier = ''
archiveBaseName = "acryl-spark-lineage_${sv}"
mergeServiceFiles()
// Scala-specific OpenLineage dependency
scalaConfig.dependencies.add(
project.dependencies.create(
"io.openlineage:openlineage-spark_${sv}:${openLineageVersion}"))
configurations = [scalaConfig]
// ... exclusions and relocations
}
}
Shadow JAR Relocations (L178-211)
The following packages are relocated to avoid classpath conflicts:
relocate 'com.fasterxml.jackson', 'io.acryl.shaded.jackson'
relocate 'com.google.common', 'io.acryl.shaded.com.google.common'
relocate 'org.apache.hc', 'io.acryl.shaded.http'
relocate 'org.apache.commons.codec', 'io.acryl.shaded.org.apache.commons.codec'
relocate 'org.apache.commons.compress', 'io.acryl.shaded.org.apache.commons.compress'
relocate 'org.apache.commons.lang3', 'io.acryl.shaded.org.apache.commons.lang3'
relocate 'com.typesafe', 'io.acryl.shaded.com.typesafe'
relocate 'io.netty', 'io.acryl.shaded.io.netty'
relocate 'org.springframework', 'io.acryl.shaded.org.springframework'
relocate 'org.yaml', 'io.acryl.shaded.org.yaml'
relocate 'com.github.benmanes.caffeine', 'io.acryl.shaded.com.github.benmanes.caffeine'
relocate 'org.checkerframework', 'io.acryl.shaded.org.checkerframework'
relocate 'com.google.errorprone', 'io.acryl.shaded.com.google.errorprone'
relocate 'javax.annotation', 'io.acryl.shaded.javax.annotation'
relocate 'org.reflections', 'io.acryl.shaded.org.reflections'
relocate 'org.json', 'io.acryl.shaded.org.json'
relocate 'com.github', 'io.acryl.shaded.com.github'
relocate 'io.opentracing', 'io.acryl.shaded.io.opentracing'
relocate 'ch.qos', 'io.acryl.shaded.ch.qos'
relocate 'javassist', 'io.acryl.shaded.javassist'
I/O Contract
| Aspect | Details |
|---|---|
| Input | Java source code, Gradle dependency declarations, Scala version parameter (2.12 or 2.13) |
| Output | Shadow JAR: acryl-spark-lineage_2.12-VERSION.jar or acryl-spark-lineage_2.13-VERSION.jar
|
| Build Command | ./gradlew :metadata-integration:java:acryl-spark-lineage:shadowJar
|
| Published Artifact | io.acryl:acryl-spark-lineage_2.12:VERSION or io.acryl:acryl-spark-lineage_2.13:VERSION
|
Usage Examples
Spark Submit with Maven Packages
spark-submit \
--packages io.acryl:acryl-spark-lineage_2.12:0.13.1 \
--conf "spark.extraListeners=datahub.spark.DatahubSparkListener" \
--conf "spark.datahub.rest.server=http://localhost:8080" \
my_spark_app.py
Spark Submit with Local JAR
spark-submit \
--jars /path/to/acryl-spark-lineage_2.12-0.13.1.jar \
--conf "spark.extraListeners=datahub.spark.DatahubSparkListener" \
--conf "spark.datahub.rest.server=http://localhost:8080" \
my_spark_app.py
Scala 2.13 Variant
spark-submit \
--packages io.acryl:acryl-spark-lineage_2.13:0.13.1 \
--conf "spark.extraListeners=datahub.spark.DatahubSparkListener" \
--conf "spark.datahub.rest.server=http://localhost:8080" \
my_spark_app.py
Key Configuration Notes
- The
providedconfiguration excludes Spark SQL, Spark Hive, Hadoop, and Commons IO from the shadow JAR since they are supplied by the Spark runtime - SLF4J and Logback are excluded from the shadow JAR (
exclude(dependency("org.slf4j::"))andexclude(dependency("ch.qos.logback:"))) - Native ZStandard libraries are excluded (
exclude '**/libzstd-jni.*') - The
zip64 = trueflag enables support for large JARs exceeding 65535 entries - Service files are merged (
mergeServiceFiles()) to preserve SPI registrations