Principle:Datahub project Datahub Spark Lineage Configuration
Metadata
| Field | Value |
|---|---|
| principle_name | Spark Lineage Configuration |
| description | The process of fine-tuning lineage capture behavior including dataset paths, platform mapping, schema inclusion, and coalescing options |
| type | Principle |
| status | Active |
| last_updated | 2026-02-10 |
| domains | Data_Lineage, Apache_Spark, Metadata_Management |
| repository | datahub-project/datahub |
Overview
Spark Lineage Configuration is the process of fine-tuning lineage capture behavior including dataset paths, platform mapping, schema inclusion, and coalescing options. It controls what metadata is captured and how it is organized through a combination of Spark properties and a structured configuration object, allowing fine-grained control over the lineage agent's capture behavior.
Description
Beyond connection settings, the DataHub Spark lineage agent provides extensive configuration for controlling what metadata is captured and how datasets are mapped to DataHub entities. These settings are consolidated into the SparkLineageConf object, which is constructed from the parsed Spark configuration and passed to all downstream components.
The lineage configuration covers several distinct areas:
Coalescing
By default, the lineage agent coalesces multiple Spark job events within a single application into a unified set of MCPs. This reduces the number of metadata writes and produces a cleaner lineage graph. Coalescing can be configured with:
- coalesce_jobs (default:
true) -- Enable or disable job coalescing - stage_metadata_coalescing -- Emit coalesced metadata periodically during execution (essential for Databricks, which does not always fire application end events)
Column-Level Lineage
- captureColumnLevelLineage (default:
true) -- Controls whether fine-grained column-level lineage is captured from Spark's query plans. When enabled, the agent extracts field-level transformation information from OpenLineage ColumnLineageDatasetFacet.
Schema Metadata
- metadata.dataset.include_schema_metadata (default:
false) -- When enabled, dataset schemas are captured from OpenLineage SchemaDatasetFacet and emitted as DataHub SchemaMetadata aspects. - metadata.dataset.materialize -- Materialize dataset entities in DataHub even if they do not already exist.
Dataset Environment and Platform
- metadata.dataset.env (default:
PROD) -- The FabricType (environment) for emitted dataset URNs (PROD, DEV, STAGING, etc.) - metadata.dataset.platformInstance -- Common platform instance to apply to all datasets
- metadata.dataset.hivePlatformAlias (default:
hive) -- Platform name for Hive-symlinked datasets - metadata.dataset.lowerCaseUrns -- Lowercase all dataset URN components
Path Spec Mapping
Path specs allow mapping filesystem paths to specific DataHub platforms, environments, and platform instances:
- platform.<name>.<alias>.path_spec_list -- Comma-separated list of path patterns
- platform.<name>.<alias>.env -- Environment override for matched paths
- platform.<name>.<alias>.platformInstance -- Platform instance override for matched paths
Tags and Domains
- tags -- Comma-separated list of tag names to apply to the DataFlow entity
- domains -- Comma-separated list of domain URNs to apply to the DataFlow entity
Patch Mode
- patch.enabled (default:
false) -- Use patch-based MCP emission instead of full overwrites
Additional Options
- flow_name -- Override the DataFlow name (defaults to the Spark app name)
- streaming_job -- Flag to indicate a streaming Spark application
- streaming_heartbeat (default:
300seconds) -- Heartbeat interval for streaming jobs - log.mcps (default:
true) -- Log serialized MCPs for debugging - parent.datajob_urn -- URN of a parent DataJob for establishing job hierarchy
Theoretical Basis
This principle follows the configuration composition pattern, where lineage behavior is configured through a combination of Spark properties and a structured configuration object. The SparkLineageConf acts as a composed configuration that aggregates:
- Connection settings from
DatahubEmitterConfig - Lineage behavior from
DatahubOpenlineageConfig - Application context from
SparkAppContext - User preferences such as tags, domains, and coalescing options
The Builder pattern (via Lombok @Builder) is used to construct the SparkLineageConf, ensuring that all fields are set through a fluent API with sensible defaults.
Usage
This principle applies when customizing which lineage information is captured and how datasets are mapped to DataHub entities.
spark-submit \
--conf "spark.datahub.rest.server=http://localhost:8080" \
--conf "spark.extraListeners=datahub.spark.DatahubSparkListener" \
# Coalescing
--conf "spark.datahub.coalesce_jobs=true" \
--conf "spark.datahub.stage_metadata_coalescing=true" \
# Column-level lineage
--conf "spark.datahub.captureColumnLevelLineage=true" \
# Schema metadata
--conf "spark.datahub.metadata.dataset.include_schema_metadata=true" \
--conf "spark.datahub.metadata.dataset.materialize=true" \
# Dataset environment
--conf "spark.datahub.metadata.dataset.env=PROD" \
--conf "spark.datahub.metadata.dataset.platformInstance=my_cluster" \
# Tags and domains
--conf "spark.datahub.tags=etl,production,daily" \
--conf "spark.datahub.domains=urn:li:domain:analytics,urn:li:domain:finance" \
# Patch mode
--conf "spark.datahub.patch.enabled=true" \
# Flow name override
--conf "spark.datahub.flow_name=my_etl_pipeline" \
my_spark_app.py
Knowledge Sources
Related
- Implemented by: Datahub_project_Datahub_SparkLineageConf_Builder
Implementation:Datahub_project_Datahub_SparkLineageConf_Builder