Implementation:Datahub project Datahub Spark Submit Agent JAR
| Attribute | Value |
|---|---|
| Page Type | Implementation (External Tool Doc) |
| Workflow | Spark_Lineage_Capture |
| Pair | 1 of 6 |
| Principle | Principle:Datahub_project_Datahub_Agent_Deployment |
| Repository | https://github.com/datahub-project/datahub |
| Source Location | metadata-integration/java/acryl-spark-lineage/README.md:L1-478 |
| Last Updated | 2026-02-09 17:00 GMT |
Overview
Description
The Spark Submit Agent JAR is the deployment artifact for the DataHub Spark lineage agent. It is a shadow (uber) JAR produced by the acryl-spark-lineage module that bundles the DatahubSparkListener, the OpenLineage Spark integration, the DataHub Java client emitters, and all transitive dependencies into a single relocatable archive. This JAR is deployed to the Spark driver classpath via spark-submit --packages, spark-submit --jars, or Spark configuration properties.
The artifact is published to Maven Central under the coordinates io.acryl:acryl-spark-lineage_2.12 (for Scala 2.12) and io.acryl:acryl-spark-lineage_2.13 (for Scala 2.13). Versioning follows the main DataHub repository's semantic versioning scheme.
Usage
The agent JAR can be deployed through several mechanisms depending on the Spark environment:
spark-submit command line:
spark-submit \
--packages io.acryl:acryl-spark-lineage_2.12:0.2.18 \
--conf "spark.extraListeners=datahub.spark.DatahubSparkListener" \
--conf "spark.datahub.rest.server=http://localhost:8080" \
my_spark_job.py
Spark configuration file (spark-defaults.conf):
spark.jars.packages io.acryl:acryl-spark-lineage_2.12:0.2.18
spark.extraListeners datahub.spark.DatahubSparkListener
spark.datahub.rest.server http://localhost:8080
Python notebook (PySpark):
spark = SparkSession.builder \
.master("spark://spark-master:7077") \
.appName("test-application") \
.config("spark.jars.packages", "io.acryl:acryl-spark-lineage_2.12:0.2.18") \
.config("spark.extraListeners", "datahub.spark.DatahubSparkListener") \
.config("spark.datahub.rest.server", "http://localhost:8080") \
.enableHiveSupport() \
.getOrCreate()
Java application:
SparkSession spark = SparkSession.builder()
.appName("test-application")
.config("spark.master", "spark://spark-master:7077")
.config("spark.jars.packages", "io.acryl:acryl-spark-lineage_2.12:0.2.18")
.config("spark.extraListeners", "datahub.spark.DatahubSparkListener")
.config("spark.datahub.rest.server", "http://localhost:8080")
.enableHiveSupport()
.getOrCreate();
Databricks: The JAR is uploaded to DBFS and loaded via an init script:
#!/bin/bash
cp /dbfs/datahub/datahub-spark-lineage*.jar /databricks/jars
Amazon EMR: The agent is configured through the spark-defaults configuration properties as documented in the EMR release guide.
Code Reference
Source Location
| File | metadata-integration/java/acryl-spark-lineage/README.md
|
| Lines | L1-478 |
| Module | acryl-spark-lineage
|
| Maven Coordinates | io.acryl:acryl-spark-lineage_2.12:<version> or io.acryl:acryl-spark-lineage_2.13:<version>
|
Signature
The agent is not invoked as an API but rather deployed as a JAR artifact. The primary entry point is the listener class registration:
spark.extraListeners=datahub.spark.DatahubSparkListener
Alternatively, the JAR can be provided directly:
spark-submit --jars /path/to/acryl-spark-lineage.jar ...
spark-submit --packages io.acryl:acryl-spark-lineage_2.12:<version> ...
Import
The listener class is loaded by Spark's classloader mechanism. No explicit import is required by the user. The internal class path is:
import datahub.spark.DatahubSparkListener;
I/O Contract
| Direction | Type | Description |
|---|---|---|
| Input | Spark Configuration | spark.jars.packages or spark.jars property specifying the agent JAR coordinates or path.
|
| Input | Spark Configuration | spark.extraListeners=datahub.spark.DatahubSparkListener to register the listener.
|
| Input | Spark Configuration | spark.datahub.* properties controlling emitter behavior, transport, and metadata options.
|
| Output | Side Effect | The agent registers a SparkListener that intercepts lifecycle events and emits lineage MCPs to the configured transport (REST, Kafka, File, S3).
|
Usage Examples
Example 1: Minimal spark-submit with REST emission
spark-submit \
--packages io.acryl:acryl-spark-lineage_2.12:0.2.18 \
--conf "spark.extraListeners=datahub.spark.DatahubSparkListener" \
--conf "spark.datahub.rest.server=https://datahub.mycompany.com/gms" \
--conf "spark.datahub.rest.token=my_auth_token" \
etl_pipeline.py
Example 2: Scala 2.13 with Kafka emission
spark-submit \
--packages io.acryl:acryl-spark-lineage_2.13:0.2.18 \
--conf "spark.extraListeners=datahub.spark.DatahubSparkListener" \
--conf "spark.datahub.emitter=kafka" \
--conf "spark.datahub.kafka.bootstrap=kafka-broker:9092" \
--conf "spark.datahub.kafka.schema_registry_url=http://schema-registry:8081" \
etl_pipeline.py
Example 3: Databricks cluster configuration
spark.extraListeners datahub.spark.DatahubSparkListener
spark.datahub.rest.server https://datahub.mycompany.com/gms
spark.datahub.rest.token {{secrets/datahub/rest-token}}
spark.datahub.stage_metadata_coalescing true
spark.datahub.databricks.cluster my-cluster-name