Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:Datahub project Datahub Spark Lineage Environment

From Leeroopedia


Knowledge Sources
Domains Infrastructure, Spark, Lineage
Last Updated 2026-02-10 00:00 GMT

Overview

Apache Spark 3.x environment with Scala 2.12 or 2.13, the acryl-spark-lineage shadow JAR, and a DataHub GMS or Kafka endpoint for automatic lineage capture.

Description

This environment defines the runtime prerequisites for the DataHub Spark Lineage agent. The agent is a Spark listener (implementing `SparkListener` and `StreamingQueryListener`) packaged as a shadow JAR. Since version 0.2.18, separate JARs are published for Scala 2.12 and 2.13 to match the Spark cluster's Scala version. The agent intercepts Spark job events, converts them to OpenLineage format, then emits DataHub MCPs (Metadata Change Proposals) to GMS via REST, Kafka, file, or S3 emitters.

Usage

Use this environment when deploying the DataHub Spark Lineage agent on any Spark 3.x cluster (standalone, YARN, Databricks, EMR, Glue). The agent must be added to the Spark classpath and registered as an extra listener via Spark configuration properties.

System Requirements

Category Requirement Notes
Spark Apache Spark 3.x Spark 2.x not supported
Scala 2.12 or 2.13 Must match Spark cluster Scala version
Java JDK 8+ (runtime) JAR can target Java 8 bytecode via `-PjavaClassVersionDefault=8`
DataHub GMS endpoint or Kafka cluster For receiving lineage metadata

Dependencies

JAR Dependencies

Since version 0.2.18, use Scala-version-specific artifacts:

  • Scala 2.12: `io.acryl:acryl-spark-lineage_2.12:{version}`
  • Scala 2.13: `io.acryl:acryl-spark-lineage_2.13:{version}`

Spark Configuration Properties

Required Spark properties to register the listener:

  • `spark.extraListeners` = `datahub.spark.DatahubSparkListener`
  • `spark.datahub.rest.server` = GMS URL (e.g., `http://localhost:8080`)
  • `spark.datahub.rest.token` = GMS authentication token (optional)

Build Dependencies (for building from source)

  • JDK 17 (build time)
  • Gradle 8.14+ (via wrapper)
  • OpenLineage 1.33.0

Credentials

The following Spark configuration properties provide authentication:

  • `spark.datahub.rest.server`: DataHub GMS REST endpoint URL
  • `spark.datahub.rest.token`: GMS authentication token
  • `spark.datahub.kafka.bootstrap`: Kafka bootstrap servers (for Kafka emitter)
  • `spark.datahub.kafka.schemaRegistryUrl`: Schema Registry URL (for Kafka emitter)

Quick Install

# Using spark-submit with the shadow JAR
spark-submit \
  --packages io.acryl:acryl-spark-lineage_2.12:0.2.18 \
  --conf "spark.extraListeners=datahub.spark.DatahubSparkListener" \
  --conf "spark.datahub.rest.server=http://localhost:8080" \
  your_spark_app.py

# Building from source
./gradlew -PjavaClassVersionDefault=8 \
  :metadata-integration:java:acryl-spark-lineage:shadowJar

Code Evidence

Spark listener registration from `DatahubSparkListener.java:60-80`:

public class DatahubSparkListener extends SparkListener {
    // Registers as SparkListener and StreamingQueryListener
    // Intercepts onApplicationStart, onApplicationEnd, onJobEnd events
    // Converts to OpenLineage RunEvents and emits via DatahubEventEmitter
}

Scala version split in `build.gradle:15`:

// Separate JARs published for Scala 2.12 and 2.13
// Artifact naming: io.acryl:acryl-spark-lineage_{scalaVersion}:{version}

Configuration parsing from `SparkConfigParser.java:26-60`:

// All properties prefixed with spark.datahub.*
// Parses rest.server, rest.token, kafka.bootstrap, etc.
// Supports REST, Kafka, File, and S3 emitter types

Common Errors

Error Message Cause Solution
`ClassNotFoundException: datahub.spark.DatahubSparkListener` Shadow JAR not on classpath Add JAR via `--packages` or `--jars` flag
Scala version mismatch Using Scala 2.12 JAR on 2.13 cluster Match JAR Scala version to cluster: `_2.12` or `_2.13`
Empty pipeline with no tasks Spark job failed before completion Check Spark job logs; lineage captured on successful events
Unreliable custom properties with concurrent apps Multiple apps with same appName Use unique `spark.app.name` per concurrent application

Compatibility Notes

  • Databricks: Must set `spark.datahub.stage_metadata_coalescing=true` because `onApplicationEnd` is never called on Databricks clusters.
  • Databricks Standard/High-concurrency clusters: Not fully tested; use single-user clusters when possible.
  • AWS Glue: Can also enable `stage_metadata_coalescing` for coalesced runs.
  • Column-level lineage: Disable via `spark.datahub.captureColumnLevelLineage=false` for improved performance on large datasets.
  • MERGE INTO: Set `spark.datahub.metadata.dataset.enableEnhancedMergeIntoExtraction=true` for better table name extraction on Databricks.
  • Path-based datasets: Use `path_spec_list` to customize table name extraction from file paths.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment