Implementation:Datahub project Datahub DatahubSparkListener Legacy
| Knowledge Sources | |
|---|---|
| Domains | Spark_Lineage |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
DEPRECATED: This implementation is part of the legacy spark-lineage-legacy module and has been superseded by acryl-spark-lineage. See Heuristic:Datahub_project_Datahub_Warning_Deprecated_Spark_Lineage_Legacy for migration guidance.
A legacy Apache Spark SparkListener implementation that captures SQL query execution events, extracts dataset lineage from Spark logical plans, and emits metadata to DataHub via the McpEmitter or configurable LineageConsumer implementations.
Description
DatahubSparkListener (in the spark-lineage-legacy module) hooks into Spark's event system to capture metadata about data pipeline executions. It is designed to run as a Spark listener registered via spark.extraListeners.
Event Handling:
onApplicationStart()-- Initializes application tracking: parses Spark config, createsAppStartEvent, sets upMcpEmitter(orCoalesceJobsEmitterif coalescing is enabled), and registers the app in internal maps.onApplicationEnd()-- EmitsAppEndEvent, closes emitters and consumers, and cleans up application state.onOtherEvent()-- InterceptsSparkListenerSQLExecutionStartandSparkListenerSQLExecutionEndevents to capture SQL query lifecycle.
SQL Execution Processing (inner class SqlStartTask):
- Retrieves the
QueryExecutionfrom Spark'sSQLExecutioncontext - Extracts the optimized
LogicalPlan - Uses
DatasetExtractor.asDataset()to identify output datasets from the plan - Traverses the plan tree (including inner children) to identify input datasets
- Constructs a
DatasetLineagewith source and target datasets - Emits
SQLQueryExecStartEventto theMcpEmitterand any registered consumers
Spark Version Compatibility:
The serializeSqlStartEvent() method handles both Spark 3.4+ (Jackson-based API via reflection) and older versions (json4s-based API) for serializing SQL execution events.
Configuration:
spark.datahub.lineage.consumerTypes-- Comma-separated list ofLineageConsumerimplementationscoalesce_jobs-- Boolean to enable job coalescing viaCoalesceJobsEmittermetadata.pipeline.platformInstance-- Platform instance prefix for pipeline namesdatabricks.cluster-- Databricks cluster identifier for naming
Usage
This is the legacy Spark lineage listener. It is configured by adding it to spark.extraListeners in Spark configuration. For new deployments, the OpenLineage-based Spark integration is preferred. This legacy listener is maintained for backward compatibility with existing Spark 2.x and early 3.x deployments.
Code Reference
Source Location
- Repository: Datahub_project_Datahub
- File: metadata-integration/java/spark-lineage-legacy/src/main/java/datahub/spark/DatahubSparkListener.java
Signature
@Slf4j
public class DatahubSparkListener extends SparkListener {
// Spark listener overrides
@Override
public void onApplicationStart(SparkListenerApplicationStart applicationStart);
@Override
public void onApplicationEnd(SparkListenerApplicationEnd applicationEnd);
@Override
public void onOtherEvent(SparkListenerEvent event);
// Execution processing
public void processExecutionEnd(SparkListenerSQLExecutionEnd sqlEnd);
// Inner class for SQL start processing
private class SqlStartTask {
public SqlStartTask(SparkListenerSQLExecutionStart sqlStart, LogicalPlan plan, SparkContext ctx);
public void run();
}
}
Import
import datahub.spark.DatahubSparkListener;
I/O Contract
| Input | Type | Description |
|---|---|---|
| SparkListenerApplicationStart | Spark event | Application lifecycle start event |
| SparkListenerApplicationEnd | Spark event | Application lifecycle end event |
| SparkListenerSQLExecutionStart | Spark event | SQL query execution start with plan |
| SparkListenerSQLExecutionEnd | Spark event | SQL query execution completion |
| Output | Type | Description |
|---|---|---|
| AppStartEvent / AppEndEvent | DataHub events | Application lifecycle events emitted to consumers |
| SQLQueryExecStartEvent | DataHub event | Contains DatasetLineage with source and target datasets
|
| SQLQueryExecEndEvent | DataHub event | Completion event paired with corresponding start event |
Usage Examples
// Spark configuration to enable the listener
// spark.extraListeners=datahub.spark.DatahubSparkListener
// spark.datahub.rest.server=http://localhost:8080
// spark.datahub.rest.token=<your-token>
// spark.datahub.coalesce_jobs=true
// The listener is automatically invoked by Spark's event system.
// No programmatic instantiation is needed.