Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Datahub project Datahub DatahubSparkListener Legacy

From Leeroopedia


Knowledge Sources
Domains Spark_Lineage
Last Updated 2026-02-10 00:00 GMT

Overview

DEPRECATED: This implementation is part of the legacy spark-lineage-legacy module and has been superseded by acryl-spark-lineage. See Heuristic:Datahub_project_Datahub_Warning_Deprecated_Spark_Lineage_Legacy for migration guidance.

A legacy Apache Spark SparkListener implementation that captures SQL query execution events, extracts dataset lineage from Spark logical plans, and emits metadata to DataHub via the McpEmitter or configurable LineageConsumer implementations.

Description

DatahubSparkListener (in the spark-lineage-legacy module) hooks into Spark's event system to capture metadata about data pipeline executions. It is designed to run as a Spark listener registered via spark.extraListeners.

Event Handling:

  • onApplicationStart() -- Initializes application tracking: parses Spark config, creates AppStartEvent, sets up McpEmitter (or CoalesceJobsEmitter if coalescing is enabled), and registers the app in internal maps.
  • onApplicationEnd() -- Emits AppEndEvent, closes emitters and consumers, and cleans up application state.
  • onOtherEvent() -- Intercepts SparkListenerSQLExecutionStart and SparkListenerSQLExecutionEnd events to capture SQL query lifecycle.

SQL Execution Processing (inner class SqlStartTask):

  1. Retrieves the QueryExecution from Spark's SQLExecution context
  2. Extracts the optimized LogicalPlan
  3. Uses DatasetExtractor.asDataset() to identify output datasets from the plan
  4. Traverses the plan tree (including inner children) to identify input datasets
  5. Constructs a DatasetLineage with source and target datasets
  6. Emits SQLQueryExecStartEvent to the McpEmitter and any registered consumers

Spark Version Compatibility: The serializeSqlStartEvent() method handles both Spark 3.4+ (Jackson-based API via reflection) and older versions (json4s-based API) for serializing SQL execution events.

Configuration:

  • spark.datahub.lineage.consumerTypes -- Comma-separated list of LineageConsumer implementations
  • coalesce_jobs -- Boolean to enable job coalescing via CoalesceJobsEmitter
  • metadata.pipeline.platformInstance -- Platform instance prefix for pipeline names
  • databricks.cluster -- Databricks cluster identifier for naming

Usage

This is the legacy Spark lineage listener. It is configured by adding it to spark.extraListeners in Spark configuration. For new deployments, the OpenLineage-based Spark integration is preferred. This legacy listener is maintained for backward compatibility with existing Spark 2.x and early 3.x deployments.

Code Reference

Source Location

Signature

@Slf4j
public class DatahubSparkListener extends SparkListener {

    // Spark listener overrides
    @Override
    public void onApplicationStart(SparkListenerApplicationStart applicationStart);

    @Override
    public void onApplicationEnd(SparkListenerApplicationEnd applicationEnd);

    @Override
    public void onOtherEvent(SparkListenerEvent event);

    // Execution processing
    public void processExecutionEnd(SparkListenerSQLExecutionEnd sqlEnd);

    // Inner class for SQL start processing
    private class SqlStartTask {
        public SqlStartTask(SparkListenerSQLExecutionStart sqlStart, LogicalPlan plan, SparkContext ctx);
        public void run();
    }
}

Import

import datahub.spark.DatahubSparkListener;

I/O Contract

Input Type Description
SparkListenerApplicationStart Spark event Application lifecycle start event
SparkListenerApplicationEnd Spark event Application lifecycle end event
SparkListenerSQLExecutionStart Spark event SQL query execution start with plan
SparkListenerSQLExecutionEnd Spark event SQL query execution completion
Output Type Description
AppStartEvent / AppEndEvent DataHub events Application lifecycle events emitted to consumers
SQLQueryExecStartEvent DataHub event Contains DatasetLineage with source and target datasets
SQLQueryExecEndEvent DataHub event Completion event paired with corresponding start event

Usage Examples

// Spark configuration to enable the listener
// spark.extraListeners=datahub.spark.DatahubSparkListener
// spark.datahub.rest.server=http://localhost:8080
// spark.datahub.rest.token=<your-token>
// spark.datahub.coalesce_jobs=true

// The listener is automatically invoked by Spark's event system.
// No programmatic instantiation is needed.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment