Heuristic:Datahub project Datahub Spark Databricks Coalescing

Knowledge Sources	Spark Lineage README DatahubSparkListener.java
Domains	Spark, Lineage, Databricks
Last Updated	2026-02-10 00:00 GMT

Overview

Databricks clusters never fire the `onApplicationEnd` Spark event, requiring `stage_metadata_coalescing=true` to ensure lineage data is emitted during intermediate job stages.

Description

The DataHub Spark Lineage agent normally emits the final consolidated lineage metadata in the `onApplicationEnd` event handler. However, on Databricks clusters, this event is never triggered because Spark applications run continuously as long as the cluster is alive. Without the coalescing workaround, lineage metadata would never be emitted on Databricks. The `stage_metadata_coalescing` flag changes the emission strategy to emit lineage incrementally during intermediate stages (job end events) rather than waiting for application shutdown.

Usage

Use this heuristic when deploying the DataHub Spark Lineage agent on Databricks clusters. It is also useful on AWS Glue or any environment where Spark applications run as long-lived services that may not terminate cleanly.

The Insight (Rule of Thumb)

Action: Set `spark.datahub.stage_metadata_coalescing=true` in Spark configuration.
Value: Boolean flag, default `false`.
Trade-off: Slightly more frequent (but smaller) emissions to DataHub; lineage metadata arrives incrementally rather than as a single consolidated batch. This may result in more API calls to GMS but ensures data is never lost.

Reasoning

Databricks runs Spark as a service — the SparkContext is initialized when the cluster starts and only terminates when the cluster is shut down (which may be hours or days later). The standard `onApplicationEnd` listener callback is therefore unreliable or never called. By coalescing metadata at intermediate stages (after individual jobs complete), the agent ensures lineage is captured even if the cluster never shuts down cleanly.

This is also relevant for:

Databricks Standard/High-concurrency clusters: These are long-lived shared clusters where application end is never triggered per-user.
AWS Glue: Similar pattern where Spark sessions may not terminate cleanly.

Additional Databricks-specific tips:

For MERGE INTO operations, enable `spark.datahub.metadata.dataset.enableEnhancedMergeIntoExtraction=true` for better table name resolution.
Disable column-level lineage (`spark.datahub.captureColumnLevelLineage=false`) on large datasets for improved performance.

Code Evidence

From `README.md` (Spark lineage):

spark.datahub.stage_metadata_coalescing=true
# Must be set on Databricks because onApplicationEnd is never called.
# Can also be enabled on Glue for coalesced runs.

From `DatahubSparkListener.java`:

// The listener intercepts onApplicationStart, onApplicationEnd, onJobEnd
// When stage_metadata_coalescing is true, lineage is emitted on each
// onJobEnd rather than waiting for onApplicationEnd

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment