Principle:Datahub project Datahub Automatic Lineage Emission

Metadata

Field	Value
principle_name	Automatic Lineage Emission
description	The process of automatically converting Spark job events into DataHub metadata change proposals and emitting them
type	Principle
status	Active
last_updated	2026-02-10
domains	Data_Lineage, Apache_Spark, Metadata_Management
repository	datahub-project/datahub

Overview

Automatic Lineage Emission is the process of automatically converting Spark job events into DataHub Metadata Change Proposals (MCPs) and emitting them to the configured backend. This process happens transparently during Spark job execution when the DataHub listener is registered, requiring no application code changes.

Description

The automatic lineage emission process operates as a multi-stage pipeline that transforms Spark lifecycle events into DataHub metadata:

Stage 1: Event Capture

The OpenLineageSparkListener (delegated from DatahubSparkListener) intercepts Spark lifecycle events and parses Spark logical plans to produce OpenLineage RunEvent objects. Each RunEvent contains:

Job information: namespace, name, facets (SQL, job type, ownership)
Run information: run ID, facets (processing engine, Spark version, parent run)
Input datasets: source tables/files with schema and column lineage facets
Output datasets: target tables/files with schema and column lineage facets
Event type: START, RUNNING, COMPLETE, FAIL, ABORT, OTHER

Stage 2: Conversion

The OpenLineageToDataHub converter transforms each RunEvent into a DatahubJob object, which contains:

DataFlowUrn: Identifies the pipeline (orchestrator + flow name + namespace)
DataJobUrn: Identifies the specific job within the pipeline
DataFlowInfo: Flow-level metadata (name, custom properties, description)
DataJobInfo: Job-level metadata (name, type, custom properties, timestamps)
Input/Output DatahubDatasets: Dataset URNs with optional schema and lineage
DataProcessInstance: Run-level metadata (run ID, status, timestamps, relationships)

Stage 3: Buffering

Each converted DatahubJob is added to an internal buffer (_datahubJobs list) within the DatahubEventEmitter. Multiple Spark jobs within a single application execution accumulate in this buffer.

Stage 4: Coalescing

When the Spark application ends (or periodically, if configured), buffered jobs are merged into a single coalesced DatahubJob. The coalescing process:

Merges input and output dataset sets (union with schema/lineage updates)
Merges custom properties from all jobs
Tracks the minimum start time and maximum end time across all jobs
Merges DataProcessInstance status (FAILURE overrides SUCCESS)
Preserves the first DataProcessInstance URN for stable identity
Applies configured tags and domains to the coalesced flow

Stage 5: MCP Generation

The coalesced DatahubJob generates MCPs for:

DataFlow entity: DataFlowInfo, Ownership, GlobalTags, Domains, DataPlatformInstance
DataJob entity: DataJobInfo, DataJobInputOutput
Dataset entities: SchemaMetadata, UpstreamLineage (if materialize/schema enabled)
DataProcessInstance entity: Properties, Relationships, RunEvent

Stage 6: Emission

MCPs are emitted to the configured backend (REST API, Kafka, local file, or S3). Each MCP is sent individually with a configurable timeout, and failures are logged without blocking subsequent emissions.

Theoretical Basis

This principle follows the event-driven pipeline pattern, where Spark lifecycle events are transformed through a multi-stage pipeline:

Spark Events -> OpenLineage RunEvents -> DatahubJob objects -> Buffer -> Coalesce -> MCPs -> Emit

Key design patterns:

Event sourcing: All lineage information originates from Spark's event stream
Buffering with batch flush: Events are accumulated and flushed as a batch at application end
Idempotent emission: Coalesced MCPs can be re-emitted safely (same URNs produce same entities)
Failure isolation: Individual MCP emission failures do not prevent other MCPs from being emitted

The coalescing strategy is particularly important because a single Spark application may execute many Spark jobs (one per action), and each job may reference overlapping sets of datasets. Without coalescing, the lineage graph would contain many redundant DataJob entities for what is logically a single pipeline execution.

Usage

Automatic lineage emission happens transparently during Spark job execution when the listener is registered. No additional configuration is required beyond the listener registration and connection settings.

The emission behavior can be controlled through configuration:

Configuration	Effect
`spark.datahub.coalesce_jobs=true`	Merge all jobs into one at application end (default)
`spark.datahub.coalesce_jobs=false`	Emit each job independently as it completes
`spark.datahub.stage_metadata_coalescing=true`	Emit coalesced data periodically during execution
`spark.datahub.streaming_job=true`	Skip coalesced emission at application end (for streaming)
`spark.datahub.log.mcps=true`	Log serialized MCP JSON for debugging

Emission Modes

Mode	Trigger	Description
Coalesced (default)	Application end	All jobs merged and emitted once at application termination
Non-coalesced	Each job completion	Each job emitted independently when it completes
Periodic coalescing	Each job completion + application end	Coalesced data emitted after every job and at termination
Streaming	Heartbeat interval	Separate streaming emission logic with periodic heartbeats

Knowledge Sources

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment