Implementation:Datahub project Datahub SparkConfigParser Configuration
| Attribute | Value |
|---|---|
| Page Type | Implementation (API Doc) |
| Workflow | Spark_Lineage_Capture |
| Pair | 2 of 6 |
| Principle | Principle:Datahub_project_Datahub_Spark_Listener_Configuration |
| Repository | https://github.com/datahub-project/datahub |
| Source Location | metadata-integration/java/acryl-spark-lineage/src/main/java/datahub/spark/conf/SparkConfigParser.java:L1-418 |
| Last Updated | 2026-02-09 17:00 GMT |
Overview
Description
SparkConfigParser is the central configuration parsing class that reads Spark configuration properties prefixed with spark.datahub. and converts them into a strongly-typed DatahubOpenlineageConfig object. This class is a utility with only static methods (the constructor is private), providing a clean functional interface for configuration parsing.
The parser strips the spark.datahub. prefix from property keys, loads them into a Typesafe Config object, and then uses individual accessor methods to extract each configuration value with appropriate defaults. The primary entry point, sparkConfigToDatahubOpenlineageConf, orchestrates the full parsing pipeline and returns a builder-constructed configuration object.
Usage
SparkConfigParser is called during listener initialization in DatahubSparkListener.loadDatahubConfig(). It is not typically called directly by users but is essential to understanding how Spark properties map to agent behavior.
The parser handles the following configuration namespaces:
| Spark Property | Config Key | Default | Description |
|---|---|---|---|
spark.datahub.rest.server |
rest.server |
http://localhost:8080 |
GMS server URL |
spark.datahub.rest.token |
rest.token |
(none) | Authentication token |
spark.datahub.emitter |
emitter |
rest |
Transport type: rest, kafka, file, s3 |
spark.datahub.coalesce_jobs |
coalesce_jobs |
true |
Merge all jobs into one DataJob |
spark.datahub.metadata.dataset.env |
metadata.dataset.env |
PROD |
FabricType for dataset URNs |
spark.datahub.metadata.dataset.materialize |
metadata.dataset.materialize |
false |
Create dataset entities in DataHub |
spark.datahub.metadata.dataset.include_schema_metadata |
metadata.dataset.include_schema_metadata |
false |
Emit schema metadata from Spark |
spark.datahub.metadata.dataset.platformInstance |
metadata.dataset.platformInstance |
(none) | Dataset-level platform instance |
spark.datahub.metadata.pipeline.platformInstance |
metadata.pipeline.platformInstance |
(none) | Pipeline-level platform instance |
spark.datahub.captureColumnLevelLineage |
captureColumnLevelLineage |
true |
Capture column-level lineage |
spark.datahub.disableSymlinkResolution |
disableSymlinkResolution |
false |
Prefer S3 paths over Hive table names |
spark.datahub.patch.enabled |
patch.enabled |
false |
Append lineage edges instead of overwrite |
spark.datahub.metadata.dataset.lowerCaseUrns |
metadata.dataset.lowerCaseUrns |
false |
Lowercase dataset URNs |
spark.datahub.metadata.dataset.hivePlatformAlias |
metadata.dataset.hivePlatformAlias |
hive |
Platform alias for Hive tables |
spark.datahub.stage_metadata_coalescing |
stage_metadata_coalescing |
false |
Emit coalesced data periodically |
spark.datahub.streaming_heartbeat |
streaming_heartbeat |
300 |
Streaming heartbeat interval (seconds) |
spark.datahub.flow_name |
flow_name |
(app name) | Override the DataFlow name |
spark.datahub.tags |
tags |
(none) | Comma-separated tags |
spark.datahub.domains |
domains |
(none) | Comma-separated domain URNs |
Code Reference
Source Location
| File | metadata-integration/java/acryl-spark-lineage/src/main/java/datahub/spark/conf/SparkConfigParser.java
|
| Lines | L1-418 |
| Module | acryl-spark-lineage
|
Signature
Primary method:
public static DatahubOpenlineageConfig sparkConfigToDatahubOpenlineageConf(
Config sparkConfig,
SparkAppContext sparkAppContext
) -> DatahubOpenlineageConfig
Configuration parsing methods:
public static Config parseSparkConfig()
public static Config parsePropertiesToConfig(Properties properties)
public static Properties moveKeysToRoot(Properties properties, String prefix)
Individual property accessors:
public static FabricType getCommonFabricType(Config datahubConfig)
public static boolean isCoalesceEnabled(Config datahubConfig)
public static boolean isDatasetMaterialize(Config datahubConfig)
public static boolean isIncludeSchemaMetadata(Config datahubConfig)
public static boolean isCaptureColumnLevelLineage(Config datahubConfig)
public static boolean isDisableSymlinkResolution(Config datahubConfig)
public static boolean isPatchEnabled(Config datahubConfig)
public static boolean isLowerCaseDatasetUrns(Config datahubConfig)
public static String getPlatformInstance(Config pathSpecConfig)
public static String getCommonPlatformInstance(Config datahubConfig)
public static String getHivePlatformAlias(Config datahubConfig)
public static String getPipelineName(Config datahubConfig, SparkAppContext appContext)
public static String[] getTags(Config datahubConfig)
public static String[] getDomains(Config datahubConfig)
public static Map<String, List<PathSpec>> getPathSpecListMap(Config datahubConfig)
public static int getStreamingHeartbeatSec(Config datahubConfig)
Import
import datahub.spark.conf.SparkConfigParser;
I/O Contract
| Direction | Type | Description |
|---|---|---|
| Input | Config (Typesafe) |
Spark configuration with spark.datahub. prefix stripped, parsed into a Typesafe Config tree.
|
| Input | SparkAppContext |
Application context containing app name, app ID, attempt ID, Spark user, and start time. |
| Output | DatahubOpenlineageConfig |
Strongly-typed configuration object with fields: fabricType (FabricType enum), platformInstance (String), commonDatasetPlatformInstance (String), hivePlatformAlias (String, default "hive"), pipelineName (String), includeSchemaMetadata (boolean), materializeDataset (boolean), captureColumnLevelLineage (boolean, default true), disableSymlinkResolution (boolean), lowerCaseDatasetUrns (boolean), usePatch (boolean), pathSpecs (Map of platform to PathSpec list), filePartitionRegexpPattern (String), parentJobUrn (DataJobUrn), isSpark (true).
|
Usage Examples
Example 1: Parsing Spark environment config
// Called internally by DatahubSparkListener during initialization
Config sparkConfig = SparkConfigParser.parseSparkConfig();
SparkAppContext appContext = new SparkAppContext();
appContext.setAppName("my-etl-job");
DatahubOpenlineageConfig config =
SparkConfigParser.sparkConfigToDatahubOpenlineageConf(sparkConfig, appContext);
// config.getFabricType() -> FabricType.PROD (default)
// config.isCaptureColumnLevelLineage() -> true (default)
// config.isMaterializeDataset() -> false (default)
Example 2: Parsing from Properties object
Properties props = new Properties();
props.setProperty("spark.datahub.rest.server", "https://datahub.example.com/gms");
props.setProperty("spark.datahub.rest.token", "my-token");
props.setProperty("spark.datahub.metadata.dataset.env", "DEV");
props.setProperty("spark.datahub.coalesce_jobs", "true");
Config config = SparkConfigParser.parsePropertiesToConfig(props);
FabricType env = SparkConfigParser.getCommonFabricType(config);
// env -> FabricType.DEV