Environment:Datahub project Datahub Spark Lineage Environment

Knowledge Sources	acryl-spark-lineage Spark Integration README
Domains	Infrastructure, Spark, Lineage
Last Updated	2026-02-10 00:00 GMT

Overview

Apache Spark 3.x environment with Scala 2.12 or 2.13, the acryl-spark-lineage shadow JAR, and a DataHub GMS or Kafka endpoint for automatic lineage capture.

Description

This environment defines the runtime prerequisites for the DataHub Spark Lineage agent. The agent is a Spark listener (implementing `SparkListener` and `StreamingQueryListener`) packaged as a shadow JAR. Since version 0.2.18, separate JARs are published for Scala 2.12 and 2.13 to match the Spark cluster's Scala version. The agent intercepts Spark job events, converts them to OpenLineage format, then emits DataHub MCPs (Metadata Change Proposals) to GMS via REST, Kafka, file, or S3 emitters.

Usage

Use this environment when deploying the DataHub Spark Lineage agent on any Spark 3.x cluster (standalone, YARN, Databricks, EMR, Glue). The agent must be added to the Spark classpath and registered as an extra listener via Spark configuration properties.

System Requirements

Category	Requirement	Notes
Spark	Apache Spark 3.x	Spark 2.x not supported
Scala	2.12 or 2.13	Must match Spark cluster Scala version
Java	JDK 8+ (runtime)	JAR can target Java 8 bytecode via `-PjavaClassVersionDefault=8`
DataHub	GMS endpoint or Kafka cluster	For receiving lineage metadata

Dependencies

JAR Dependencies

Since version 0.2.18, use Scala-version-specific artifacts:

Scala 2.12: `io.acryl:acryl-spark-lineage_2.12:{version}`
Scala 2.13: `io.acryl:acryl-spark-lineage_2.13:{version}`

Spark Configuration Properties

Required Spark properties to register the listener:

`spark.extraListeners` = `datahub.spark.DatahubSparkListener`
`spark.datahub.rest.server` = GMS URL (e.g., `http://localhost:8080`)
`spark.datahub.rest.token` = GMS authentication token (optional)

Build Dependencies (for building from source)

JDK 17 (build time)
Gradle 8.14+ (via wrapper)
OpenLineage 1.33.0

Credentials

The following Spark configuration properties provide authentication:

`spark.datahub.rest.server`: DataHub GMS REST endpoint URL
`spark.datahub.rest.token`: GMS authentication token
`spark.datahub.kafka.bootstrap`: Kafka bootstrap servers (for Kafka emitter)
`spark.datahub.kafka.schemaRegistryUrl`: Schema Registry URL (for Kafka emitter)

Quick Install

# Using spark-submit with the shadow JAR
spark-submit \
  --packages io.acryl:acryl-spark-lineage_2.12:0.2.18 \
  --conf "spark.extraListeners=datahub.spark.DatahubSparkListener" \
  --conf "spark.datahub.rest.server=http://localhost:8080" \
  your_spark_app.py

# Building from source
./gradlew -PjavaClassVersionDefault=8 \
  :metadata-integration:java:acryl-spark-lineage:shadowJar

Code Evidence

Spark listener registration from `DatahubSparkListener.java:60-80`:

public class DatahubSparkListener extends SparkListener {
    // Registers as SparkListener and StreamingQueryListener
    // Intercepts onApplicationStart, onApplicationEnd, onJobEnd events
    // Converts to OpenLineage RunEvents and emits via DatahubEventEmitter
}

Scala version split in `build.gradle:15`:

// Separate JARs published for Scala 2.12 and 2.13
// Artifact naming: io.acryl:acryl-spark-lineage_{scalaVersion}:{version}

Configuration parsing from `SparkConfigParser.java:26-60`:

// All properties prefixed with spark.datahub.*
// Parses rest.server, rest.token, kafka.bootstrap, etc.
// Supports REST, Kafka, File, and S3 emitter types

Common Errors

Error Message	Cause	Solution
`ClassNotFoundException: datahub.spark.DatahubSparkListener`	Shadow JAR not on classpath	Add JAR via `--packages` or `--jars` flag
Scala version mismatch	Using Scala 2.12 JAR on 2.13 cluster	Match JAR Scala version to cluster: `_2.12` or `_2.13`
Empty pipeline with no tasks	Spark job failed before completion	Check Spark job logs; lineage captured on successful events
Unreliable custom properties with concurrent apps	Multiple apps with same appName	Use unique `spark.app.name` per concurrent application

Compatibility Notes

Databricks: Must set `spark.datahub.stage_metadata_coalescing=true` because `onApplicationEnd` is never called on Databricks clusters.
Databricks Standard/High-concurrency clusters: Not fully tested; use single-user clusters when possible.
AWS Glue: Can also enable `stage_metadata_coalescing` for coalesced runs.
Column-level lineage: Disable via `spark.datahub.captureColumnLevelLineage=false` for improved performance on large datasets.
MERGE INTO: Set `spark.datahub.metadata.dataset.enableEnhancedMergeIntoExtraction=true` for better table name extraction on Databricks.
Path-based datasets: Use `path_spec_list` to customize table name extraction from file paths.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment