Principle:Datahub project Datahub Spark Lineage JAR Setup

Metadata

Field	Value
principle_name	Spark Lineage JAR Setup
description	The practice of adding the DataHub Spark lineage agent JAR to a Spark application's classpath for automatic lineage capture
type	Principle
status	Active
last_updated	2026-02-10
domains	Data_Lineage, Apache_Spark, Metadata_Management
repository	datahub-project/datahub

Overview

Spark Lineage JAR Setup is the practice of adding the DataHub Spark lineage agent JAR to a Spark application's classpath for automatic lineage capture. The JAR provisions the DatahubSparkListener and its dependencies (OpenLineage adapter, DataHub client, converters) as a shadow JAR with relocated dependencies to avoid classpath collisions in the Spark environment.

Description

The DataHub Spark lineage agent is packaged as a shadow (uber) JAR that bundles all required dependencies into a single artifact. This packaging strategy is essential because Spark environments have their own complex classpaths containing libraries such as Jackson, Guava, Netty, and Apache Commons, which frequently conflict with the versions required by the lineage agent.

The shadow JAR approach relocates (shades) over 30 dependency namespaces into the io.acryl.shaded.* package prefix, ensuring that the agent's internal libraries do not collide with Spark's bundled libraries. This allows the agent to operate as a passive observer within the Spark JVM without disrupting the application's runtime behavior.

The JAR is built for multiple Scala versions (2.12 and 2.13) to match the Scala version used by the target Spark installation. The correct Scala suffix must be selected when adding the JAR to the classpath.

Key components bundled in the shadow JAR include:

DataHub Client -- REST, Kafka, File, and S3 emitter implementations for sending metadata to DataHub
OpenLineage Converter -- Converts OpenLineage RunEvent objects into DataHub Metadata Change Proposals (MCPs)
OpenLineage Spark Integration -- The upstream OpenLineage Spark listener that captures Spark lifecycle events
Typesafe Config -- Configuration parsing for HOCON-based settings under the spark.datahub.* namespace

Theoretical Basis

This principle is grounded in the Java agent pattern, where the lineage JAR is a passive agent loaded into the Spark JVM that intercepts events without modifying application code. The shadow JAR packaging technique (also known as uber JAR or fat JAR) relocates dependencies to prevent version conflicts, following the well-established practice of dependency shading in Java ecosystems.

The relocation strategy uses the Gradle Shadow plugin (com.gradleup.shadow) to rewrite bytecode references from original package names to shaded equivalents at build time. For example, com.fasterxml.jackson is relocated to io.acryl.shaded.jackson, ensuring that the agent's Jackson version never conflicts with Spark's bundled Jackson.

Usage

This principle applies when configuring a Spark application (Spark Submit, Databricks, Amazon EMR, or any Spark environment) to automatically capture and emit data lineage to DataHub. The JAR can be supplied via:

--packages flag: --packages io.acryl:acryl-spark-lineage_2.12:VERSION (Maven Central resolution)
--jars flag: --jars /path/to/acryl-spark-lineage_2.12-VERSION.jar (local file path)
Classpath configuration in cluster management tools (Databricks Libraries, EMR Bootstrap Actions)

The Scala version suffix (_2.12 or _2.13) must match the Spark cluster's Scala version.

Knowledge Sources

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment