Workflow:Datahub project Datahub Spark Lineage Capture

Knowledge Sources	DataHub Spark Integration
Domains	Data_Engineering, Lineage, Spark, Big_Data
Last Updated	2026-02-09 12:00 GMT

Overview

End-to-end process for automatically capturing Apache Spark job lineage and emitting it to DataHub in real-time using a Java listener agent.

Description

This workflow covers the Spark lineage integration, which hooks into Spark's event system via a custom listener to automatically capture data lineage. When a Spark job runs, the listener extracts input/output datasets from the query plan, creates DataFlow (pipeline) and DataJob (task) entities, and emits lineage relationships to DataHub. The integration supports REST, Kafka, and file-based emission, and works with Spark SQL, Structured Streaming, and Databricks. Column-level lineage with transformation type tracking is also supported.

Usage

Execute this workflow when you run Apache Spark jobs and want to automatically capture which datasets are read and written, producing end-to-end data lineage in DataHub. This is suitable for Spark running on standalone clusters, EMR, Databricks, or any Spark-compatible environment.

Execution Steps

Step 1: Add Lineage JAR Dependency

Include the acryl-spark-lineage JAR as a dependency in the Spark job configuration. The JAR is available from Maven Central and must match the Spark Scala version (2.12 or 2.13).

Key considerations:

Use spark.jars.packages configuration to add the dependency
Match the Scala version suffix to your Spark installation
The JAR bundles all required dependencies as a shadow JAR

Step 2: Register Spark Listener

Configure Spark to load the DataHub listener by adding it to the extraListeners configuration. The listener hooks into Spark's event system to intercept job lifecycle events.

Key considerations:

Set spark.extraListeners to datahub.spark.DatahubSparkListener
The listener automatically intercepts application start/end and query events
No code changes to Spark jobs are required

Step 3: Configure DataHub Connection

Set the DataHub server endpoint and authentication token in Spark configuration properties. Choose the emission transport (REST, Kafka, or file).

Key considerations:

REST is the simplest: spark.datahub.rest.server and spark.datahub.rest.token
Kafka requires broker and schema registry configuration
File emission writes MCPs to disk for offline ingestion
Platform instance can be set for multi-environment support

Step 4: Configure Lineage Settings

Customize dataset URN generation, platform mapping, and schema inclusion settings to match your data architecture.

Key considerations:

Path specifications map HDFS/S3 paths to meaningful dataset URNs
Platform mapping translates data source types to DataHub platform names
Schema inclusion controls whether column-level lineage is captured
Tags can be auto-generated from job configuration

Step 5: Run Spark Job

Execute the Spark job as normal. The listener automatically captures lineage events during execution and emits them to DataHub.

Key considerations:

The listener creates a DataFlow entity for the Spark application
Individual queries create DataJob entities within the flow
Input and output datasets are linked as lineage relationships
Failed jobs are also tracked with failure information

Execution Diagram

GitHub URL

Workflow Repository