Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Datahub project Datahub SparkConfigParser Configuration

From Leeroopedia


Attribute Value
Page Type Implementation (API Doc)
Workflow Spark_Lineage_Capture
Pair 2 of 6
Principle Principle:Datahub_project_Datahub_Spark_Listener_Configuration
Repository https://github.com/datahub-project/datahub
Source Location metadata-integration/java/acryl-spark-lineage/src/main/java/datahub/spark/conf/SparkConfigParser.java:L1-418
Last Updated 2026-02-09 17:00 GMT

Overview

Description

SparkConfigParser is the central configuration parsing class that reads Spark configuration properties prefixed with spark.datahub. and converts them into a strongly-typed DatahubOpenlineageConfig object. This class is a utility with only static methods (the constructor is private), providing a clean functional interface for configuration parsing.

The parser strips the spark.datahub. prefix from property keys, loads them into a Typesafe Config object, and then uses individual accessor methods to extract each configuration value with appropriate defaults. The primary entry point, sparkConfigToDatahubOpenlineageConf, orchestrates the full parsing pipeline and returns a builder-constructed configuration object.

Usage

SparkConfigParser is called during listener initialization in DatahubSparkListener.loadDatahubConfig(). It is not typically called directly by users but is essential to understanding how Spark properties map to agent behavior.

The parser handles the following configuration namespaces:

Spark Property Config Key Default Description
spark.datahub.rest.server rest.server http://localhost:8080 GMS server URL
spark.datahub.rest.token rest.token (none) Authentication token
spark.datahub.emitter emitter rest Transport type: rest, kafka, file, s3
spark.datahub.coalesce_jobs coalesce_jobs true Merge all jobs into one DataJob
spark.datahub.metadata.dataset.env metadata.dataset.env PROD FabricType for dataset URNs
spark.datahub.metadata.dataset.materialize metadata.dataset.materialize false Create dataset entities in DataHub
spark.datahub.metadata.dataset.include_schema_metadata metadata.dataset.include_schema_metadata false Emit schema metadata from Spark
spark.datahub.metadata.dataset.platformInstance metadata.dataset.platformInstance (none) Dataset-level platform instance
spark.datahub.metadata.pipeline.platformInstance metadata.pipeline.platformInstance (none) Pipeline-level platform instance
spark.datahub.captureColumnLevelLineage captureColumnLevelLineage true Capture column-level lineage
spark.datahub.disableSymlinkResolution disableSymlinkResolution false Prefer S3 paths over Hive table names
spark.datahub.patch.enabled patch.enabled false Append lineage edges instead of overwrite
spark.datahub.metadata.dataset.lowerCaseUrns metadata.dataset.lowerCaseUrns false Lowercase dataset URNs
spark.datahub.metadata.dataset.hivePlatformAlias metadata.dataset.hivePlatformAlias hive Platform alias for Hive tables
spark.datahub.stage_metadata_coalescing stage_metadata_coalescing false Emit coalesced data periodically
spark.datahub.streaming_heartbeat streaming_heartbeat 300 Streaming heartbeat interval (seconds)
spark.datahub.flow_name flow_name (app name) Override the DataFlow name
spark.datahub.tags tags (none) Comma-separated tags
spark.datahub.domains domains (none) Comma-separated domain URNs

Code Reference

Source Location

File metadata-integration/java/acryl-spark-lineage/src/main/java/datahub/spark/conf/SparkConfigParser.java
Lines L1-418
Module acryl-spark-lineage

Signature

Primary method:

public static DatahubOpenlineageConfig sparkConfigToDatahubOpenlineageConf(
    Config sparkConfig,
    SparkAppContext sparkAppContext
) -> DatahubOpenlineageConfig

Configuration parsing methods:

public static Config parseSparkConfig()
public static Config parsePropertiesToConfig(Properties properties)
public static Properties moveKeysToRoot(Properties properties, String prefix)

Individual property accessors:

public static FabricType getCommonFabricType(Config datahubConfig)
public static boolean isCoalesceEnabled(Config datahubConfig)
public static boolean isDatasetMaterialize(Config datahubConfig)
public static boolean isIncludeSchemaMetadata(Config datahubConfig)
public static boolean isCaptureColumnLevelLineage(Config datahubConfig)
public static boolean isDisableSymlinkResolution(Config datahubConfig)
public static boolean isPatchEnabled(Config datahubConfig)
public static boolean isLowerCaseDatasetUrns(Config datahubConfig)
public static String getPlatformInstance(Config pathSpecConfig)
public static String getCommonPlatformInstance(Config datahubConfig)
public static String getHivePlatformAlias(Config datahubConfig)
public static String getPipelineName(Config datahubConfig, SparkAppContext appContext)
public static String[] getTags(Config datahubConfig)
public static String[] getDomains(Config datahubConfig)
public static Map<String, List<PathSpec>> getPathSpecListMap(Config datahubConfig)
public static int getStreamingHeartbeatSec(Config datahubConfig)

Import

import datahub.spark.conf.SparkConfigParser;

I/O Contract

Direction Type Description
Input Config (Typesafe) Spark configuration with spark.datahub. prefix stripped, parsed into a Typesafe Config tree.
Input SparkAppContext Application context containing app name, app ID, attempt ID, Spark user, and start time.
Output DatahubOpenlineageConfig Strongly-typed configuration object with fields: fabricType (FabricType enum), platformInstance (String), commonDatasetPlatformInstance (String), hivePlatformAlias (String, default "hive"), pipelineName (String), includeSchemaMetadata (boolean), materializeDataset (boolean), captureColumnLevelLineage (boolean, default true), disableSymlinkResolution (boolean), lowerCaseDatasetUrns (boolean), usePatch (boolean), pathSpecs (Map of platform to PathSpec list), filePartitionRegexpPattern (String), parentJobUrn (DataJobUrn), isSpark (true).

Usage Examples

Example 1: Parsing Spark environment config

// Called internally by DatahubSparkListener during initialization
Config sparkConfig = SparkConfigParser.parseSparkConfig();
SparkAppContext appContext = new SparkAppContext();
appContext.setAppName("my-etl-job");

DatahubOpenlineageConfig config =
    SparkConfigParser.sparkConfigToDatahubOpenlineageConf(sparkConfig, appContext);

// config.getFabricType() -> FabricType.PROD (default)
// config.isCaptureColumnLevelLineage() -> true (default)
// config.isMaterializeDataset() -> false (default)

Example 2: Parsing from Properties object

Properties props = new Properties();
props.setProperty("spark.datahub.rest.server", "https://datahub.example.com/gms");
props.setProperty("spark.datahub.rest.token", "my-token");
props.setProperty("spark.datahub.metadata.dataset.env", "DEV");
props.setProperty("spark.datahub.coalesce_jobs", "true");

Config config = SparkConfigParser.parsePropertiesToConfig(props);
FabricType env = SparkConfigParser.getCommonFabricType(config);
// env -> FabricType.DEV

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment