Principle:Datahub project Datahub Spark Connection Configuration

Metadata

Field	Value
principle_name	Spark Connection Configuration
description	The process of configuring the DataHub connection parameters within Spark's configuration namespace for lineage emission
type	Principle
status	Active
last_updated	2026-02-10
domains	Data_Lineage, Apache_Spark, Metadata_Management
repository	datahub-project/datahub

Overview

Spark Connection Configuration is the process of configuring the DataHub connection parameters within Spark's configuration namespace for lineage emission. The configuration system extracts spark.datahub.* properties from SparkConf and converts them to a HOCON Config object. It supports four emitter types (rest, kafka, file, s3) with type-specific connection parameters.

Description

The DataHub Spark lineage agent reuses Spark's native configuration system to carry all DataHub-specific settings. Properties are specified under the spark.datahub.* prefix, which the SparkConfigParser class strips and parses into a Typesafe Config object.

This design eliminates the need for separate configuration files, environment variables, or external configuration services. All settings travel with the Spark application's configuration, whether specified via command-line --conf flags, spark-defaults.conf, or programmatic SparkConf API calls.

The configuration supports four distinct emitter types, each with its own parameter set:

REST Emitter (Default)

Sends MCPs to the DataHub GMS REST API:

Property	Config Key	Default	Description
`spark.datahub.rest.server`	`rest.server`	`http://localhost:8080`	GMS server URL
`spark.datahub.rest.token`	`rest.token`	null	Authentication token
`spark.datahub.rest.disable_ssl_verification`	`rest.disable_ssl_verification`	`false`	Disable SSL certificate verification
`spark.datahub.rest.disable_chunked_encoding`	`rest.disable_chunked_encoding`	`false`	Disable HTTP chunked transfer encoding
`spark.datahub.rest.max_retries`	`rest.max_retries`	`0`	Maximum number of retry attempts
`spark.datahub.rest.retry_interval_in_sec`	`rest.retry_interval_in_sec`	`5`	Seconds between retries

Kafka Emitter

Sends MCPs to a Kafka topic for asynchronous ingestion:

Property	Config Key	Description
`spark.datahub.kafka.bootstrap`	`kafka.bootstrap`	Kafka bootstrap servers
`spark.datahub.kafka.schema_registry_url`	`kafka.schema_registry_url`	Schema Registry URL
`spark.datahub.kafka.mcp_topic`	`kafka.mcp_topic`	Target MCP topic name
`spark.datahub.kafka.schema_registry_config.*`	`kafka.schema_registry_config.*`	Schema Registry configuration properties
`spark.datahub.kafka.producer_config.*`	`kafka.producer_config.*`	Kafka producer configuration properties

File Emitter

Writes MCPs to a local file:

Property	Config Key	Description
`spark.datahub.file.filename`	`file.filename`	Output file path

S3 Emitter

Writes MCPs to Amazon S3:

Property	Config Key	Description
`spark.datahub.s3.bucket`	`s3.bucket`	S3 bucket name
`spark.datahub.s3.prefix`	`s3.prefix`	Object key prefix
`spark.datahub.s3.region`	`s3.region`	AWS region
`spark.datahub.s3.profile`	`s3.profile`	AWS credential profile name
`spark.datahub.s3.endpoint`	`s3.endpoint`	Custom S3 endpoint URL
`spark.datahub.s3.access_key`	`s3.access_key`	AWS access key
`spark.datahub.s3.secret_key`	`s3.secret_key`	AWS secret key
`spark.datahub.s3.filename`	`s3.filename`	Output file name within S3

Theoretical Basis

This principle follows the namespace-scoped configuration pattern, where Spark's existing configuration system is leveraged to carry DataHub-specific settings under the spark.datahub.* prefix. This avoids separate config files and ensures that configuration travels with the application.

The configuration is parsed using Typesafe Config (HOCON format), which provides:

Hierarchical structure: Dotted keys are parsed into a tree structure
Type safety: Values are parsed as their expected types (string, boolean, int)
Defaults: Missing values fall back to sensible defaults
Overriding: Properties can be overridden at any level (command line, config file, code)

The emitter type selection follows the Strategy pattern, where the emitter configuration key selects which emitter implementation to instantiate.

Usage

This principle applies when configuring how the Spark lineage agent connects to DataHub for metadata emission.

# REST emitter (default)
spark-submit \
  --conf "spark.datahub.rest.server=https://datahub.example.com:8080" \
  --conf "spark.datahub.rest.token=eyJ..." \
  --conf "spark.datahub.rest.max_retries=3" \
  my_app.py

# Kafka emitter
spark-submit \
  --conf "spark.datahub.emitter=kafka" \
  --conf "spark.datahub.kafka.bootstrap=kafka:9092" \
  --conf "spark.datahub.kafka.schema_registry_url=http://schema-registry:8081" \
  my_app.py

# File emitter
spark-submit \
  --conf "spark.datahub.emitter=file" \
  --conf "spark.datahub.file.filename=/tmp/datahub_mcps.json" \
  my_app.py

# S3 emitter
spark-submit \
  --conf "spark.datahub.emitter=s3" \
  --conf "spark.datahub.s3.bucket=my-datahub-bucket" \
  --conf "spark.datahub.s3.prefix=lineage/" \
  --conf "spark.datahub.s3.region=us-east-1" \
  my_app.py

Knowledge Sources

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment