Principle:Datahub project Datahub Spark Lineage JAR Setup
Metadata
| Field | Value |
|---|---|
| principle_name | Spark Lineage JAR Setup |
| description | The practice of adding the DataHub Spark lineage agent JAR to a Spark application's classpath for automatic lineage capture |
| type | Principle |
| status | Active |
| last_updated | 2026-02-10 |
| domains | Data_Lineage, Apache_Spark, Metadata_Management |
| repository | datahub-project/datahub |
Overview
Spark Lineage JAR Setup is the practice of adding the DataHub Spark lineage agent JAR to a Spark application's classpath for automatic lineage capture. The JAR provisions the DatahubSparkListener and its dependencies (OpenLineage adapter, DataHub client, converters) as a shadow JAR with relocated dependencies to avoid classpath collisions in the Spark environment.
Description
The DataHub Spark lineage agent is packaged as a shadow (uber) JAR that bundles all required dependencies into a single artifact. This packaging strategy is essential because Spark environments have their own complex classpaths containing libraries such as Jackson, Guava, Netty, and Apache Commons, which frequently conflict with the versions required by the lineage agent.
The shadow JAR approach relocates (shades) over 30 dependency namespaces into the io.acryl.shaded.* package prefix, ensuring that the agent's internal libraries do not collide with Spark's bundled libraries. This allows the agent to operate as a passive observer within the Spark JVM without disrupting the application's runtime behavior.
The JAR is built for multiple Scala versions (2.12 and 2.13) to match the Scala version used by the target Spark installation. The correct Scala suffix must be selected when adding the JAR to the classpath.
Key components bundled in the shadow JAR include:
- DataHub Client -- REST, Kafka, File, and S3 emitter implementations for sending metadata to DataHub
- OpenLineage Converter -- Converts OpenLineage RunEvent objects into DataHub Metadata Change Proposals (MCPs)
- OpenLineage Spark Integration -- The upstream OpenLineage Spark listener that captures Spark lifecycle events
- Typesafe Config -- Configuration parsing for HOCON-based settings under the
spark.datahub.*namespace
Theoretical Basis
This principle is grounded in the Java agent pattern, where the lineage JAR is a passive agent loaded into the Spark JVM that intercepts events without modifying application code. The shadow JAR packaging technique (also known as uber JAR or fat JAR) relocates dependencies to prevent version conflicts, following the well-established practice of dependency shading in Java ecosystems.
The relocation strategy uses the Gradle Shadow plugin (com.gradleup.shadow) to rewrite bytecode references from original package names to shaded equivalents at build time. For example, com.fasterxml.jackson is relocated to io.acryl.shaded.jackson, ensuring that the agent's Jackson version never conflicts with Spark's bundled Jackson.
Usage
This principle applies when configuring a Spark application (Spark Submit, Databricks, Amazon EMR, or any Spark environment) to automatically capture and emit data lineage to DataHub. The JAR can be supplied via:
- --packages flag:
--packages io.acryl:acryl-spark-lineage_2.12:VERSION(Maven Central resolution) - --jars flag:
--jars /path/to/acryl-spark-lineage_2.12-VERSION.jar(local file path) - Classpath configuration in cluster management tools (Databricks Libraries, EMR Bootstrap Actions)
The Scala version suffix (_2.12 or _2.13) must match the Spark cluster's Scala version.
Knowledge Sources
- DataHub GitHub Repository
- OpenLineage Documentation
- Apache Spark Configuration Documentation
- Gradle Shadow Plugin Documentation
Related
- Implemented by: Datahub_project_Datahub_Spark_Lineage_JAR_Dependency
Implementation:Datahub_project_Datahub_Spark_Lineage_JAR_Dependency