Principle:Datahub project Datahub Automatic Lineage Emission
Metadata
| Field | Value |
|---|---|
| principle_name | Automatic Lineage Emission |
| description | The process of automatically converting Spark job events into DataHub metadata change proposals and emitting them |
| type | Principle |
| status | Active |
| last_updated | 2026-02-10 |
| domains | Data_Lineage, Apache_Spark, Metadata_Management |
| repository | datahub-project/datahub |
Overview
Automatic Lineage Emission is the process of automatically converting Spark job events into DataHub Metadata Change Proposals (MCPs) and emitting them to the configured backend. This process happens transparently during Spark job execution when the DataHub listener is registered, requiring no application code changes.
Description
The automatic lineage emission process operates as a multi-stage pipeline that transforms Spark lifecycle events into DataHub metadata:
Stage 1: Event Capture
The OpenLineageSparkListener (delegated from DatahubSparkListener) intercepts Spark lifecycle events and parses Spark logical plans to produce OpenLineage RunEvent objects. Each RunEvent contains:
- Job information: namespace, name, facets (SQL, job type, ownership)
- Run information: run ID, facets (processing engine, Spark version, parent run)
- Input datasets: source tables/files with schema and column lineage facets
- Output datasets: target tables/files with schema and column lineage facets
- Event type: START, RUNNING, COMPLETE, FAIL, ABORT, OTHER
Stage 2: Conversion
The OpenLineageToDataHub converter transforms each RunEvent into a DatahubJob object, which contains:
- DataFlowUrn: Identifies the pipeline (orchestrator + flow name + namespace)
- DataJobUrn: Identifies the specific job within the pipeline
- DataFlowInfo: Flow-level metadata (name, custom properties, description)
- DataJobInfo: Job-level metadata (name, type, custom properties, timestamps)
- Input/Output DatahubDatasets: Dataset URNs with optional schema and lineage
- DataProcessInstance: Run-level metadata (run ID, status, timestamps, relationships)
Stage 3: Buffering
Each converted DatahubJob is added to an internal buffer (_datahubJobs list) within the DatahubEventEmitter. Multiple Spark jobs within a single application execution accumulate in this buffer.
Stage 4: Coalescing
When the Spark application ends (or periodically, if configured), buffered jobs are merged into a single coalesced DatahubJob. The coalescing process:
- Merges input and output dataset sets (union with schema/lineage updates)
- Merges custom properties from all jobs
- Tracks the minimum start time and maximum end time across all jobs
- Merges DataProcessInstance status (FAILURE overrides SUCCESS)
- Preserves the first DataProcessInstance URN for stable identity
- Applies configured tags and domains to the coalesced flow
Stage 5: MCP Generation
The coalesced DatahubJob generates MCPs for:
- DataFlow entity: DataFlowInfo, Ownership, GlobalTags, Domains, DataPlatformInstance
- DataJob entity: DataJobInfo, DataJobInputOutput
- Dataset entities: SchemaMetadata, UpstreamLineage (if materialize/schema enabled)
- DataProcessInstance entity: Properties, Relationships, RunEvent
Stage 6: Emission
MCPs are emitted to the configured backend (REST API, Kafka, local file, or S3). Each MCP is sent individually with a configurable timeout, and failures are logged without blocking subsequent emissions.
Theoretical Basis
This principle follows the event-driven pipeline pattern, where Spark lifecycle events are transformed through a multi-stage pipeline:
Spark Events -> OpenLineage RunEvents -> DatahubJob objects -> Buffer -> Coalesce -> MCPs -> Emit
Key design patterns:
- Event sourcing: All lineage information originates from Spark's event stream
- Buffering with batch flush: Events are accumulated and flushed as a batch at application end
- Idempotent emission: Coalesced MCPs can be re-emitted safely (same URNs produce same entities)
- Failure isolation: Individual MCP emission failures do not prevent other MCPs from being emitted
The coalescing strategy is particularly important because a single Spark application may execute many Spark jobs (one per action), and each job may reference overlapping sets of datasets. Without coalescing, the lineage graph would contain many redundant DataJob entities for what is logically a single pipeline execution.
Usage
Automatic lineage emission happens transparently during Spark job execution when the listener is registered. No additional configuration is required beyond the listener registration and connection settings.
The emission behavior can be controlled through configuration:
| Configuration | Effect |
|---|---|
spark.datahub.coalesce_jobs=true |
Merge all jobs into one at application end (default) |
spark.datahub.coalesce_jobs=false |
Emit each job independently as it completes |
spark.datahub.stage_metadata_coalescing=true |
Emit coalesced data periodically during execution |
spark.datahub.streaming_job=true |
Skip coalesced emission at application end (for streaming) |
spark.datahub.log.mcps=true |
Log serialized MCP JSON for debugging |
Emission Modes
| Mode | Trigger | Description |
|---|---|---|
| Coalesced (default) | Application end | All jobs merged and emitted once at application termination |
| Non-coalesced | Each job completion | Each job emitted independently when it completes |
| Periodic coalescing | Each job completion + application end | Coalesced data emitted after every job and at termination |
| Streaming | Heartbeat interval | Separate streaming emission logic with periodic heartbeats |
Knowledge Sources
- DataHub GitHub Repository
- OpenLineage Documentation
- OpenLineage RunEvent Specification
- Apache Spark SparkListener API
Related
- Implemented by: Datahub_project_Datahub_DatahubEventEmitter_Emit
Implementation:Datahub_project_Datahub_DatahubEventEmitter_Emit