Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Datahub project Datahub Spark Lineage Configuration

From Leeroopedia


Metadata

Field Value
principle_name Spark Lineage Configuration
description The process of fine-tuning lineage capture behavior including dataset paths, platform mapping, schema inclusion, and coalescing options
type Principle
status Active
last_updated 2026-02-10
domains Data_Lineage, Apache_Spark, Metadata_Management
repository datahub-project/datahub

Overview

Spark Lineage Configuration is the process of fine-tuning lineage capture behavior including dataset paths, platform mapping, schema inclusion, and coalescing options. It controls what metadata is captured and how it is organized through a combination of Spark properties and a structured configuration object, allowing fine-grained control over the lineage agent's capture behavior.

Description

Beyond connection settings, the DataHub Spark lineage agent provides extensive configuration for controlling what metadata is captured and how datasets are mapped to DataHub entities. These settings are consolidated into the SparkLineageConf object, which is constructed from the parsed Spark configuration and passed to all downstream components.

The lineage configuration covers several distinct areas:

Coalescing

By default, the lineage agent coalesces multiple Spark job events within a single application into a unified set of MCPs. This reduces the number of metadata writes and produces a cleaner lineage graph. Coalescing can be configured with:

  • coalesce_jobs (default: true) -- Enable or disable job coalescing
  • stage_metadata_coalescing -- Emit coalesced metadata periodically during execution (essential for Databricks, which does not always fire application end events)

Column-Level Lineage

  • captureColumnLevelLineage (default: true) -- Controls whether fine-grained column-level lineage is captured from Spark's query plans. When enabled, the agent extracts field-level transformation information from OpenLineage ColumnLineageDatasetFacet.

Schema Metadata

  • metadata.dataset.include_schema_metadata (default: false) -- When enabled, dataset schemas are captured from OpenLineage SchemaDatasetFacet and emitted as DataHub SchemaMetadata aspects.
  • metadata.dataset.materialize -- Materialize dataset entities in DataHub even if they do not already exist.

Dataset Environment and Platform

  • metadata.dataset.env (default: PROD) -- The FabricType (environment) for emitted dataset URNs (PROD, DEV, STAGING, etc.)
  • metadata.dataset.platformInstance -- Common platform instance to apply to all datasets
  • metadata.dataset.hivePlatformAlias (default: hive) -- Platform name for Hive-symlinked datasets
  • metadata.dataset.lowerCaseUrns -- Lowercase all dataset URN components

Path Spec Mapping

Path specs allow mapping filesystem paths to specific DataHub platforms, environments, and platform instances:

  • platform.<name>.<alias>.path_spec_list -- Comma-separated list of path patterns
  • platform.<name>.<alias>.env -- Environment override for matched paths
  • platform.<name>.<alias>.platformInstance -- Platform instance override for matched paths

Tags and Domains

  • tags -- Comma-separated list of tag names to apply to the DataFlow entity
  • domains -- Comma-separated list of domain URNs to apply to the DataFlow entity

Patch Mode

  • patch.enabled (default: false) -- Use patch-based MCP emission instead of full overwrites

Additional Options

  • flow_name -- Override the DataFlow name (defaults to the Spark app name)
  • streaming_job -- Flag to indicate a streaming Spark application
  • streaming_heartbeat (default: 300 seconds) -- Heartbeat interval for streaming jobs
  • log.mcps (default: true) -- Log serialized MCPs for debugging
  • parent.datajob_urn -- URN of a parent DataJob for establishing job hierarchy

Theoretical Basis

This principle follows the configuration composition pattern, where lineage behavior is configured through a combination of Spark properties and a structured configuration object. The SparkLineageConf acts as a composed configuration that aggregates:

  1. Connection settings from DatahubEmitterConfig
  2. Lineage behavior from DatahubOpenlineageConfig
  3. Application context from SparkAppContext
  4. User preferences such as tags, domains, and coalescing options

The Builder pattern (via Lombok @Builder) is used to construct the SparkLineageConf, ensuring that all fields are set through a fluent API with sensible defaults.

Usage

This principle applies when customizing which lineage information is captured and how datasets are mapped to DataHub entities.

spark-submit \
  --conf "spark.datahub.rest.server=http://localhost:8080" \
  --conf "spark.extraListeners=datahub.spark.DatahubSparkListener" \
  # Coalescing
  --conf "spark.datahub.coalesce_jobs=true" \
  --conf "spark.datahub.stage_metadata_coalescing=true" \
  # Column-level lineage
  --conf "spark.datahub.captureColumnLevelLineage=true" \
  # Schema metadata
  --conf "spark.datahub.metadata.dataset.include_schema_metadata=true" \
  --conf "spark.datahub.metadata.dataset.materialize=true" \
  # Dataset environment
  --conf "spark.datahub.metadata.dataset.env=PROD" \
  --conf "spark.datahub.metadata.dataset.platformInstance=my_cluster" \
  # Tags and domains
  --conf "spark.datahub.tags=etl,production,daily" \
  --conf "spark.datahub.domains=urn:li:domain:analytics,urn:li:domain:finance" \
  # Patch mode
  --conf "spark.datahub.patch.enabled=true" \
  # Flow name override
  --conf "spark.datahub.flow_name=my_etl_pipeline" \
  my_spark_app.py

Knowledge Sources

Related

Implementation:Datahub_project_Datahub_SparkLineageConf_Builder

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment