Principle:Datahub project Datahub Spark Event Interception
| Attribute | Value |
|---|---|
| Page Type | Principle |
| Workflow | Spark_Lineage_Capture |
| Pair | 3 of 6 |
| Implementation | Implementation:Datahub_project_Datahub_DatahubSparkListener_Lifecycle |
| Repository | https://github.com/datahub-project/datahub |
| Last Updated | 2026-02-09 17:00 GMT |
Overview
Description
Spark Event Interception is the principle of intercepting distributed processing engine lifecycle events to capture metadata about data transformations as they occur. In DataHub's Spark integration, this is implemented through the DatahubSparkListener class, which extends Spark's native SparkListener interface to receive callbacks for application start/end and job start/end events.
The listener acts as an observer in the Spark runtime, receiving notifications at each stage of the Spark application lifecycle. When events are received, the listener delegates to two internal components: an OpenLineageSparkListener (which converts Spark execution plans into standardized OpenLineage events) and a DatahubEventEmitter (which converts those events into DataHub metadata and emits them to the configured transport).
This principle ensures that metadata capture is event-driven rather than poll-based, capturing lineage information at the exact moments when Spark jobs begin and complete execution.
Usage
Spark Event Interception is activated by registering the listener class via the spark.extraListeners configuration property:
spark.extraListeners=datahub.spark.DatahubSparkListener
Once registered, the listener automatically receives callbacks for:
- Application start: Captures the application name, ID, attempt ID, Spark user, and start time. Initializes the OpenLineage context factory and the DataHub event emitter.
- Application end: Triggers the final coalesced emission of all accumulated lineage metadata (when coalescing is enabled). For non-streaming jobs, this is the primary emission point.
- Job start: Delegates to the OpenLineage listener to begin tracking the Spark execution plan for the job. Ensures the context factory is initialized.
- Job end: Delegates to the OpenLineage listener to finalize the job's lineage information.
- Task end: Captures task-level execution details.
- Other events: Handles SQL execution events and streaming micro-batch events through the OpenLineage delegation layer.
Theoretical Basis
The Spark Event Interception principle is rooted in the Observer pattern from object-oriented design, applied to the domain of distributed processing engines. The key theoretical foundations are:
Observer pattern: The SparkListener interface defines a contract between the Spark runtime (the subject) and external observers. Observers register themselves and receive notifications about state changes without the subject needing to know the observer's specific implementation. This decouples the Spark runtime from metadata collection logic.
Event-driven metadata collection: Rather than periodically scanning for metadata or parsing logs after execution completes, the listener captures metadata at the exact moment events occur. This provides several theoretical advantages:
- Temporal accuracy: Metadata timestamps correspond to actual execution times, not discovery times.
- Completeness: Every job lifecycle transition generates an event, so no executions are missed.
- Causality preservation: The order of events is preserved, enabling accurate reconstruction of execution sequences.
Delegation pattern: The DatahubSparkListener does not directly implement lineage extraction logic. Instead, it delegates to the OpenLineage library for plan analysis and to the DataHub emitter for metadata delivery. This separation of concerns means the listener itself remains a thin coordination layer, while complex logic resides in specialized components.
Lazy initialization: The listener defers context factory initialization until the first event is received, rather than initializing during construction. This pattern accounts for the fact that the full Spark environment (including SparkEnv and SparkContext) may not be available at listener construction time.