Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Datahub project Datahub Spark Lineage Capture

From Leeroopedia


Knowledge Sources
Domains Data_Engineering, Lineage, Spark, Big_Data
Last Updated 2026-02-09 12:00 GMT

Overview

End-to-end process for automatically capturing Apache Spark job lineage and emitting it to DataHub in real-time using a Java listener agent.

Description

This workflow covers the Spark lineage integration, which hooks into Spark's event system via a custom listener to automatically capture data lineage. When a Spark job runs, the listener extracts input/output datasets from the query plan, creates DataFlow (pipeline) and DataJob (task) entities, and emits lineage relationships to DataHub. The integration supports REST, Kafka, and file-based emission, and works with Spark SQL, Structured Streaming, and Databricks. Column-level lineage with transformation type tracking is also supported.

Usage

Execute this workflow when you run Apache Spark jobs and want to automatically capture which datasets are read and written, producing end-to-end data lineage in DataHub. This is suitable for Spark running on standalone clusters, EMR, Databricks, or any Spark-compatible environment.

Execution Steps

Step 1: Add Lineage JAR Dependency

Include the acryl-spark-lineage JAR as a dependency in the Spark job configuration. The JAR is available from Maven Central and must match the Spark Scala version (2.12 or 2.13).

Key considerations:

  • Use spark.jars.packages configuration to add the dependency
  • Match the Scala version suffix to your Spark installation
  • The JAR bundles all required dependencies as a shadow JAR

Step 2: Register Spark Listener

Configure Spark to load the DataHub listener by adding it to the extraListeners configuration. The listener hooks into Spark's event system to intercept job lifecycle events.

Key considerations:

  • Set spark.extraListeners to datahub.spark.DatahubSparkListener
  • The listener automatically intercepts application start/end and query events
  • No code changes to Spark jobs are required

Step 3: Configure DataHub Connection

Set the DataHub server endpoint and authentication token in Spark configuration properties. Choose the emission transport (REST, Kafka, or file).

Key considerations:

  • REST is the simplest: spark.datahub.rest.server and spark.datahub.rest.token
  • Kafka requires broker and schema registry configuration
  • File emission writes MCPs to disk for offline ingestion
  • Platform instance can be set for multi-environment support

Step 4: Configure Lineage Settings

Customize dataset URN generation, platform mapping, and schema inclusion settings to match your data architecture.

Key considerations:

  • Path specifications map HDFS/S3 paths to meaningful dataset URNs
  • Platform mapping translates data source types to DataHub platform names
  • Schema inclusion controls whether column-level lineage is captured
  • Tags can be auto-generated from job configuration

Step 5: Run Spark Job

Execute the Spark job as normal. The listener automatically captures lineage events during execution and emits them to DataHub.

Key considerations:

  • The listener creates a DataFlow entity for the Spark application
  • Individual queries create DataJob entities within the flow
  • Input and output datasets are linked as lineage relationships
  • Failed jobs are also tracked with failure information

Execution Diagram

GitHub URL

Workflow Repository