Principle:Datahub project Datahub Lineage Emission

Attribute	Value
Page Type	Principle
Workflow	Spark_Lineage_Capture
Pair	5 of 6
Implementation	Implementation:Datahub_project_Datahub_DatahubEventEmitter_Emit
Repository	https://github.com/datahub-project/datahub
Last Updated	2026-02-09 17:00 GMT

Overview

Description

Lineage Emission is the principle of batched, coalesced emission of lineage metadata from a processing engine's event stream. Rather than emitting metadata for every individual Spark query execution independently, the DataHub Spark agent accumulates lineage information across all jobs within a Spark application and merges them into a single, comprehensive set of Metadata Change Proposals (MCPs) that are emitted at application end.

This accumulate-then-flush strategy addresses a fundamental challenge in Spark lineage capture: a single Spark application may execute dozens or hundreds of individual queries, each producing input/output dataset relationships. Emitting each independently would create many small, potentially overlapping DataJobs in DataHub. Instead, the coalescing mechanism merges all jobs into a single DataJob per application, combining all input datasets and output datasets into unified lineage edges.

Usage

The Lineage Emission principle governs how and when captured lineage metadata is delivered to DataHub. The behavior is controlled by several configuration properties:

Coalescing enabled (spark.datahub.coalesce_jobs, default true): When enabled, all jobs within a Spark application are merged into a single DataJob. Lineage is emitted at application end.
Coalescing disabled: Each OpenLineage RunEvent is immediately converted to MCPs and emitted via the configured transport. This produces one DataJob per Spark query execution.
Periodic coalescing (spark.datahub.stage_metadata_coalescing): On platforms like Databricks where the application end event is never fired, coalesced metadata is emitted periodically after each job completion.

The emission itself supports four transport mechanisms:

REST: MCPs are sent to the DataHub GMS server via HTTP REST API (default).
Kafka: MCPs are published to a Kafka MCP topic for asynchronous ingestion.
File: MCPs are written to a local file for offline processing.
S3: MCPs are written to an S3 bucket for cloud-native workflows.

Each MCP is emitted with a 10-second timeout. The emitter handles failures by logging errors without crashing the Spark application, ensuring that metadata collection issues do not impact data processing.

Theoretical Basis

The Lineage Emission principle is grounded in the event coalescing pattern and the accumulate-then-flush strategy:

Event coalescing: When a system generates many fine-grained events that represent parts of a larger logical operation, it is often more efficient and semantically correct to merge them before acting on them. In the Spark context:

A Spark application reading from 5 tables and writing to 2 tables might execute 10+ individual Spark jobs internally.
Each job may have overlapping inputs (e.g., multiple joins on the same table).
Coalescing merges all these into a single DataJob with a unified set of input edges and output edges.

This reduces noise in the metadata graph and produces a lineage view that matches the user's mental model of "one Spark application = one logical data transformation."

Accumulate-then-flush: The emitter maintains an in-memory list of DatahubJob objects throughout the application's lifetime. At flush time (application end or periodic trigger), it merges all accumulated jobs by:

Combining input datasets across all jobs (deduplicating by URN).
Combining output datasets across all jobs (deduplicating by URN).
Taking the minimum start time and maximum end time.
Merging custom properties from all constituent jobs.
Selecting the most severe run result (failure overrides success).

Transport abstraction: The emitter decouples the "what to emit" from the "how to emit" through a transport abstraction layer. The same set of MCPs can be delivered via REST, Kafka, file, or S3 without changing the coalescing or conversion logic. This follows the Strategy pattern where the transport mechanism is selected at configuration time and injected into the emitter.

Fault tolerance: The emission layer is designed to be non-blocking with respect to the Spark application. Each MCP emission uses a Future with a timeout, and failures are logged rather than propagated. This ensures that DataHub infrastructure outages or network issues do not cause Spark job failures.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment