Principle:Datahub project Datahub Spark Lineage Configuration

Metadata

Field	Value
principle_name	Spark Lineage Configuration
description	The process of fine-tuning lineage capture behavior including dataset paths, platform mapping, schema inclusion, and coalescing options
type	Principle
status	Active
last_updated	2026-02-10
domains	Data_Lineage, Apache_Spark, Metadata_Management
repository	datahub-project/datahub

Overview

Spark Lineage Configuration is the process of fine-tuning lineage capture behavior including dataset paths, platform mapping, schema inclusion, and coalescing options. It controls what metadata is captured and how it is organized through a combination of Spark properties and a structured configuration object, allowing fine-grained control over the lineage agent's capture behavior.

Description

Beyond connection settings, the DataHub Spark lineage agent provides extensive configuration for controlling what metadata is captured and how datasets are mapped to DataHub entities. These settings are consolidated into the SparkLineageConf object, which is constructed from the parsed Spark configuration and passed to all downstream components.

The lineage configuration covers several distinct areas:

Coalescing

By default, the lineage agent coalesces multiple Spark job events within a single application into a unified set of MCPs. This reduces the number of metadata writes and produces a cleaner lineage graph. Coalescing can be configured with:

coalesce_jobs (default: true) -- Enable or disable job coalescing
stage_metadata_coalescing -- Emit coalesced metadata periodically during execution (essential for Databricks, which does not always fire application end events)

Column-Level Lineage

captureColumnLevelLineage (default: true) -- Controls whether fine-grained column-level lineage is captured from Spark's query plans. When enabled, the agent extracts field-level transformation information from OpenLineage ColumnLineageDatasetFacet.

Schema Metadata

metadata.dataset.include_schema_metadata (default: false) -- When enabled, dataset schemas are captured from OpenLineage SchemaDatasetFacet and emitted as DataHub SchemaMetadata aspects.
metadata.dataset.materialize -- Materialize dataset entities in DataHub even if they do not already exist.

Dataset Environment and Platform

metadata.dataset.env (default: PROD) -- The FabricType (environment) for emitted dataset URNs (PROD, DEV, STAGING, etc.)
metadata.dataset.platformInstance -- Common platform instance to apply to all datasets
metadata.dataset.hivePlatformAlias (default: hive) -- Platform name for Hive-symlinked datasets
metadata.dataset.lowerCaseUrns -- Lowercase all dataset URN components

Path Spec Mapping

Path specs allow mapping filesystem paths to specific DataHub platforms, environments, and platform instances:

platform.<name>.<alias>.path_spec_list -- Comma-separated list of path patterns
platform.<name>.<alias>.env -- Environment override for matched paths
platform.<name>.<alias>.platformInstance -- Platform instance override for matched paths

Tags and Domains

tags -- Comma-separated list of tag names to apply to the DataFlow entity
domains -- Comma-separated list of domain URNs to apply to the DataFlow entity

Patch Mode

patch.enabled (default: false) -- Use patch-based MCP emission instead of full overwrites

Additional Options

flow_name -- Override the DataFlow name (defaults to the Spark app name)
streaming_job -- Flag to indicate a streaming Spark application
streaming_heartbeat (default: 300 seconds) -- Heartbeat interval for streaming jobs
log.mcps (default: true) -- Log serialized MCPs for debugging
parent.datajob_urn -- URN of a parent DataJob for establishing job hierarchy

Theoretical Basis

This principle follows the configuration composition pattern, where lineage behavior is configured through a combination of Spark properties and a structured configuration object. The SparkLineageConf acts as a composed configuration that aggregates:

Connection settings from DatahubEmitterConfig
Lineage behavior from DatahubOpenlineageConfig
Application context from SparkAppContext
User preferences such as tags, domains, and coalescing options

The Builder pattern (via Lombok @Builder) is used to construct the SparkLineageConf, ensuring that all fields are set through a fluent API with sensible defaults.

Usage

This principle applies when customizing which lineage information is captured and how datasets are mapped to DataHub entities.

spark-submit \
  --conf "spark.datahub.rest.server=http://localhost:8080" \
  --conf "spark.extraListeners=datahub.spark.DatahubSparkListener" \
  # Coalescing
  --conf "spark.datahub.coalesce_jobs=true" \
  --conf "spark.datahub.stage_metadata_coalescing=true" \
  # Column-level lineage
  --conf "spark.datahub.captureColumnLevelLineage=true" \
  # Schema metadata
  --conf "spark.datahub.metadata.dataset.include_schema_metadata=true" \
  --conf "spark.datahub.metadata.dataset.materialize=true" \
  # Dataset environment
  --conf "spark.datahub.metadata.dataset.env=PROD" \
  --conf "spark.datahub.metadata.dataset.platformInstance=my_cluster" \
  # Tags and domains
  --conf "spark.datahub.tags=etl,production,daily" \
  --conf "spark.datahub.domains=urn:li:domain:analytics,urn:li:domain:finance" \
  # Patch mode
  --conf "spark.datahub.patch.enabled=true" \
  # Flow name override
  --conf "spark.datahub.flow_name=my_etl_pipeline" \
  my_spark_app.py

Knowledge Sources

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment