Environment:Datahub project Datahub Spark Lineage Environment
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, Spark, Lineage |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Apache Spark 3.x environment with Scala 2.12 or 2.13, the acryl-spark-lineage shadow JAR, and a DataHub GMS or Kafka endpoint for automatic lineage capture.
Description
This environment defines the runtime prerequisites for the DataHub Spark Lineage agent. The agent is a Spark listener (implementing `SparkListener` and `StreamingQueryListener`) packaged as a shadow JAR. Since version 0.2.18, separate JARs are published for Scala 2.12 and 2.13 to match the Spark cluster's Scala version. The agent intercepts Spark job events, converts them to OpenLineage format, then emits DataHub MCPs (Metadata Change Proposals) to GMS via REST, Kafka, file, or S3 emitters.
Usage
Use this environment when deploying the DataHub Spark Lineage agent on any Spark 3.x cluster (standalone, YARN, Databricks, EMR, Glue). The agent must be added to the Spark classpath and registered as an extra listener via Spark configuration properties.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| Spark | Apache Spark 3.x | Spark 2.x not supported |
| Scala | 2.12 or 2.13 | Must match Spark cluster Scala version |
| Java | JDK 8+ (runtime) | JAR can target Java 8 bytecode via `-PjavaClassVersionDefault=8` |
| DataHub | GMS endpoint or Kafka cluster | For receiving lineage metadata |
Dependencies
JAR Dependencies
Since version 0.2.18, use Scala-version-specific artifacts:
- Scala 2.12: `io.acryl:acryl-spark-lineage_2.12:{version}`
- Scala 2.13: `io.acryl:acryl-spark-lineage_2.13:{version}`
Spark Configuration Properties
Required Spark properties to register the listener:
- `spark.extraListeners` = `datahub.spark.DatahubSparkListener`
- `spark.datahub.rest.server` = GMS URL (e.g., `http://localhost:8080`)
- `spark.datahub.rest.token` = GMS authentication token (optional)
Build Dependencies (for building from source)
- JDK 17 (build time)
- Gradle 8.14+ (via wrapper)
- OpenLineage 1.33.0
Credentials
The following Spark configuration properties provide authentication:
- `spark.datahub.rest.server`: DataHub GMS REST endpoint URL
- `spark.datahub.rest.token`: GMS authentication token
- `spark.datahub.kafka.bootstrap`: Kafka bootstrap servers (for Kafka emitter)
- `spark.datahub.kafka.schemaRegistryUrl`: Schema Registry URL (for Kafka emitter)
Quick Install
# Using spark-submit with the shadow JAR
spark-submit \
--packages io.acryl:acryl-spark-lineage_2.12:0.2.18 \
--conf "spark.extraListeners=datahub.spark.DatahubSparkListener" \
--conf "spark.datahub.rest.server=http://localhost:8080" \
your_spark_app.py
# Building from source
./gradlew -PjavaClassVersionDefault=8 \
:metadata-integration:java:acryl-spark-lineage:shadowJar
Code Evidence
Spark listener registration from `DatahubSparkListener.java:60-80`:
public class DatahubSparkListener extends SparkListener {
// Registers as SparkListener and StreamingQueryListener
// Intercepts onApplicationStart, onApplicationEnd, onJobEnd events
// Converts to OpenLineage RunEvents and emits via DatahubEventEmitter
}
Scala version split in `build.gradle:15`:
// Separate JARs published for Scala 2.12 and 2.13
// Artifact naming: io.acryl:acryl-spark-lineage_{scalaVersion}:{version}
Configuration parsing from `SparkConfigParser.java:26-60`:
// All properties prefixed with spark.datahub.*
// Parses rest.server, rest.token, kafka.bootstrap, etc.
// Supports REST, Kafka, File, and S3 emitter types
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `ClassNotFoundException: datahub.spark.DatahubSparkListener` | Shadow JAR not on classpath | Add JAR via `--packages` or `--jars` flag |
| Scala version mismatch | Using Scala 2.12 JAR on 2.13 cluster | Match JAR Scala version to cluster: `_2.12` or `_2.13` |
| Empty pipeline with no tasks | Spark job failed before completion | Check Spark job logs; lineage captured on successful events |
| Unreliable custom properties with concurrent apps | Multiple apps with same appName | Use unique `spark.app.name` per concurrent application |
Compatibility Notes
- Databricks: Must set `spark.datahub.stage_metadata_coalescing=true` because `onApplicationEnd` is never called on Databricks clusters.
- Databricks Standard/High-concurrency clusters: Not fully tested; use single-user clusters when possible.
- AWS Glue: Can also enable `stage_metadata_coalescing` for coalesced runs.
- Column-level lineage: Disable via `spark.datahub.captureColumnLevelLineage=false` for improved performance on large datasets.
- MERGE INTO: Set `spark.datahub.metadata.dataset.enableEnhancedMergeIntoExtraction=true` for better table name extraction on Databricks.
- Path-based datasets: Use `path_spec_list` to customize table name extraction from file paths.
Related Pages
- Implementation:Datahub_project_Datahub_Spark_Lineage_JAR_Dependency
- Implementation:Datahub_project_Datahub_DatahubSparkListener_Init
- Implementation:Datahub_project_Datahub_SparkConfigParser_ParseSparkConfig
- Implementation:Datahub_project_Datahub_SparkLineageConf_Builder
- Implementation:Datahub_project_Datahub_DatahubEventEmitter_Emit