Workflow:Datahub project Datahub Spark Lineage Capture
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, Lineage, Spark, Big_Data |
| Last Updated | 2026-02-09 12:00 GMT |
Overview
End-to-end process for automatically capturing Apache Spark job lineage and emitting it to DataHub in real-time using a Java listener agent.
Description
This workflow covers the Spark lineage integration, which hooks into Spark's event system via a custom listener to automatically capture data lineage. When a Spark job runs, the listener extracts input/output datasets from the query plan, creates DataFlow (pipeline) and DataJob (task) entities, and emits lineage relationships to DataHub. The integration supports REST, Kafka, and file-based emission, and works with Spark SQL, Structured Streaming, and Databricks. Column-level lineage with transformation type tracking is also supported.
Usage
Execute this workflow when you run Apache Spark jobs and want to automatically capture which datasets are read and written, producing end-to-end data lineage in DataHub. This is suitable for Spark running on standalone clusters, EMR, Databricks, or any Spark-compatible environment.
Execution Steps
Step 1: Add Lineage JAR Dependency
Include the acryl-spark-lineage JAR as a dependency in the Spark job configuration. The JAR is available from Maven Central and must match the Spark Scala version (2.12 or 2.13).
Key considerations:
- Use spark.jars.packages configuration to add the dependency
- Match the Scala version suffix to your Spark installation
- The JAR bundles all required dependencies as a shadow JAR
Step 2: Register Spark Listener
Configure Spark to load the DataHub listener by adding it to the extraListeners configuration. The listener hooks into Spark's event system to intercept job lifecycle events.
Key considerations:
- Set spark.extraListeners to datahub.spark.DatahubSparkListener
- The listener automatically intercepts application start/end and query events
- No code changes to Spark jobs are required
Step 3: Configure DataHub Connection
Set the DataHub server endpoint and authentication token in Spark configuration properties. Choose the emission transport (REST, Kafka, or file).
Key considerations:
- REST is the simplest: spark.datahub.rest.server and spark.datahub.rest.token
- Kafka requires broker and schema registry configuration
- File emission writes MCPs to disk for offline ingestion
- Platform instance can be set for multi-environment support
Step 4: Configure Lineage Settings
Customize dataset URN generation, platform mapping, and schema inclusion settings to match your data architecture.
Key considerations:
- Path specifications map HDFS/S3 paths to meaningful dataset URNs
- Platform mapping translates data source types to DataHub platform names
- Schema inclusion controls whether column-level lineage is captured
- Tags can be auto-generated from job configuration
Step 5: Run Spark Job
Execute the Spark job as normal. The listener automatically captures lineage events during execution and emits them to DataHub.
Key considerations:
- The listener creates a DataFlow entity for the Spark application
- Individual queries create DataJob entities within the flow
- Input and output datasets are linked as lineage relationships
- Failed jobs are also tracked with failure information