Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Datahub project Datahub Automatic Lineage Emission

From Leeroopedia


Metadata

Field Value
principle_name Automatic Lineage Emission
description The process of automatically converting Spark job events into DataHub metadata change proposals and emitting them
type Principle
status Active
last_updated 2026-02-10
domains Data_Lineage, Apache_Spark, Metadata_Management
repository datahub-project/datahub

Overview

Automatic Lineage Emission is the process of automatically converting Spark job events into DataHub Metadata Change Proposals (MCPs) and emitting them to the configured backend. This process happens transparently during Spark job execution when the DataHub listener is registered, requiring no application code changes.

Description

The automatic lineage emission process operates as a multi-stage pipeline that transforms Spark lifecycle events into DataHub metadata:

Stage 1: Event Capture

The OpenLineageSparkListener (delegated from DatahubSparkListener) intercepts Spark lifecycle events and parses Spark logical plans to produce OpenLineage RunEvent objects. Each RunEvent contains:

  • Job information: namespace, name, facets (SQL, job type, ownership)
  • Run information: run ID, facets (processing engine, Spark version, parent run)
  • Input datasets: source tables/files with schema and column lineage facets
  • Output datasets: target tables/files with schema and column lineage facets
  • Event type: START, RUNNING, COMPLETE, FAIL, ABORT, OTHER

Stage 2: Conversion

The OpenLineageToDataHub converter transforms each RunEvent into a DatahubJob object, which contains:

  • DataFlowUrn: Identifies the pipeline (orchestrator + flow name + namespace)
  • DataJobUrn: Identifies the specific job within the pipeline
  • DataFlowInfo: Flow-level metadata (name, custom properties, description)
  • DataJobInfo: Job-level metadata (name, type, custom properties, timestamps)
  • Input/Output DatahubDatasets: Dataset URNs with optional schema and lineage
  • DataProcessInstance: Run-level metadata (run ID, status, timestamps, relationships)

Stage 3: Buffering

Each converted DatahubJob is added to an internal buffer (_datahubJobs list) within the DatahubEventEmitter. Multiple Spark jobs within a single application execution accumulate in this buffer.

Stage 4: Coalescing

When the Spark application ends (or periodically, if configured), buffered jobs are merged into a single coalesced DatahubJob. The coalescing process:

  • Merges input and output dataset sets (union with schema/lineage updates)
  • Merges custom properties from all jobs
  • Tracks the minimum start time and maximum end time across all jobs
  • Merges DataProcessInstance status (FAILURE overrides SUCCESS)
  • Preserves the first DataProcessInstance URN for stable identity
  • Applies configured tags and domains to the coalesced flow

Stage 5: MCP Generation

The coalesced DatahubJob generates MCPs for:

  • DataFlow entity: DataFlowInfo, Ownership, GlobalTags, Domains, DataPlatformInstance
  • DataJob entity: DataJobInfo, DataJobInputOutput
  • Dataset entities: SchemaMetadata, UpstreamLineage (if materialize/schema enabled)
  • DataProcessInstance entity: Properties, Relationships, RunEvent

Stage 6: Emission

MCPs are emitted to the configured backend (REST API, Kafka, local file, or S3). Each MCP is sent individually with a configurable timeout, and failures are logged without blocking subsequent emissions.

Theoretical Basis

This principle follows the event-driven pipeline pattern, where Spark lifecycle events are transformed through a multi-stage pipeline:

Spark Events -> OpenLineage RunEvents -> DatahubJob objects -> Buffer -> Coalesce -> MCPs -> Emit

Key design patterns:

  • Event sourcing: All lineage information originates from Spark's event stream
  • Buffering with batch flush: Events are accumulated and flushed as a batch at application end
  • Idempotent emission: Coalesced MCPs can be re-emitted safely (same URNs produce same entities)
  • Failure isolation: Individual MCP emission failures do not prevent other MCPs from being emitted

The coalescing strategy is particularly important because a single Spark application may execute many Spark jobs (one per action), and each job may reference overlapping sets of datasets. Without coalescing, the lineage graph would contain many redundant DataJob entities for what is logically a single pipeline execution.

Usage

Automatic lineage emission happens transparently during Spark job execution when the listener is registered. No additional configuration is required beyond the listener registration and connection settings.

The emission behavior can be controlled through configuration:

Configuration Effect
spark.datahub.coalesce_jobs=true Merge all jobs into one at application end (default)
spark.datahub.coalesce_jobs=false Emit each job independently as it completes
spark.datahub.stage_metadata_coalescing=true Emit coalesced data periodically during execution
spark.datahub.streaming_job=true Skip coalesced emission at application end (for streaming)
spark.datahub.log.mcps=true Log serialized MCP JSON for debugging

Emission Modes

Mode Trigger Description
Coalesced (default) Application end All jobs merged and emitted once at application termination
Non-coalesced Each job completion Each job emitted independently when it completes
Periodic coalescing Each job completion + application end Coalesced data emitted after every job and at termination
Streaming Heartbeat interval Separate streaming emission logic with periodic heartbeats

Knowledge Sources

Related

Implementation:Datahub_project_Datahub_DatahubEventEmitter_Emit

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment