Principle:Datahub project Datahub Spark Connection Configuration
Metadata
| Field | Value |
|---|---|
| principle_name | Spark Connection Configuration |
| description | The process of configuring the DataHub connection parameters within Spark's configuration namespace for lineage emission |
| type | Principle |
| status | Active |
| last_updated | 2026-02-10 |
| domains | Data_Lineage, Apache_Spark, Metadata_Management |
| repository | datahub-project/datahub |
Overview
Spark Connection Configuration is the process of configuring the DataHub connection parameters within Spark's configuration namespace for lineage emission. The configuration system extracts spark.datahub.* properties from SparkConf and converts them to a HOCON Config object. It supports four emitter types (rest, kafka, file, s3) with type-specific connection parameters.
Description
The DataHub Spark lineage agent reuses Spark's native configuration system to carry all DataHub-specific settings. Properties are specified under the spark.datahub.* prefix, which the SparkConfigParser class strips and parses into a Typesafe Config object.
This design eliminates the need for separate configuration files, environment variables, or external configuration services. All settings travel with the Spark application's configuration, whether specified via command-line --conf flags, spark-defaults.conf, or programmatic SparkConf API calls.
The configuration supports four distinct emitter types, each with its own parameter set:
REST Emitter (Default)
Sends MCPs to the DataHub GMS REST API:
| Property | Config Key | Default | Description |
|---|---|---|---|
spark.datahub.rest.server |
rest.server |
http://localhost:8080 |
GMS server URL |
spark.datahub.rest.token |
rest.token |
null | Authentication token |
spark.datahub.rest.disable_ssl_verification |
rest.disable_ssl_verification |
false |
Disable SSL certificate verification |
spark.datahub.rest.disable_chunked_encoding |
rest.disable_chunked_encoding |
false |
Disable HTTP chunked transfer encoding |
spark.datahub.rest.max_retries |
rest.max_retries |
0 |
Maximum number of retry attempts |
spark.datahub.rest.retry_interval_in_sec |
rest.retry_interval_in_sec |
5 |
Seconds between retries |
Kafka Emitter
Sends MCPs to a Kafka topic for asynchronous ingestion:
| Property | Config Key | Description |
|---|---|---|
spark.datahub.kafka.bootstrap |
kafka.bootstrap |
Kafka bootstrap servers |
spark.datahub.kafka.schema_registry_url |
kafka.schema_registry_url |
Schema Registry URL |
spark.datahub.kafka.mcp_topic |
kafka.mcp_topic |
Target MCP topic name |
spark.datahub.kafka.schema_registry_config.* |
kafka.schema_registry_config.* |
Schema Registry configuration properties |
spark.datahub.kafka.producer_config.* |
kafka.producer_config.* |
Kafka producer configuration properties |
File Emitter
Writes MCPs to a local file:
| Property | Config Key | Description |
|---|---|---|
spark.datahub.file.filename |
file.filename |
Output file path |
S3 Emitter
Writes MCPs to Amazon S3:
| Property | Config Key | Description |
|---|---|---|
spark.datahub.s3.bucket |
s3.bucket |
S3 bucket name |
spark.datahub.s3.prefix |
s3.prefix |
Object key prefix |
spark.datahub.s3.region |
s3.region |
AWS region |
spark.datahub.s3.profile |
s3.profile |
AWS credential profile name |
spark.datahub.s3.endpoint |
s3.endpoint |
Custom S3 endpoint URL |
spark.datahub.s3.access_key |
s3.access_key |
AWS access key |
spark.datahub.s3.secret_key |
s3.secret_key |
AWS secret key |
spark.datahub.s3.filename |
s3.filename |
Output file name within S3 |
Theoretical Basis
This principle follows the namespace-scoped configuration pattern, where Spark's existing configuration system is leveraged to carry DataHub-specific settings under the spark.datahub.* prefix. This avoids separate config files and ensures that configuration travels with the application.
The configuration is parsed using Typesafe Config (HOCON format), which provides:
- Hierarchical structure: Dotted keys are parsed into a tree structure
- Type safety: Values are parsed as their expected types (string, boolean, int)
- Defaults: Missing values fall back to sensible defaults
- Overriding: Properties can be overridden at any level (command line, config file, code)
The emitter type selection follows the Strategy pattern, where the emitter configuration key selects which emitter implementation to instantiate.
Usage
This principle applies when configuring how the Spark lineage agent connects to DataHub for metadata emission.
# REST emitter (default)
spark-submit \
--conf "spark.datahub.rest.server=https://datahub.example.com:8080" \
--conf "spark.datahub.rest.token=eyJ..." \
--conf "spark.datahub.rest.max_retries=3" \
my_app.py
# Kafka emitter
spark-submit \
--conf "spark.datahub.emitter=kafka" \
--conf "spark.datahub.kafka.bootstrap=kafka:9092" \
--conf "spark.datahub.kafka.schema_registry_url=http://schema-registry:8081" \
my_app.py
# File emitter
spark-submit \
--conf "spark.datahub.emitter=file" \
--conf "spark.datahub.file.filename=/tmp/datahub_mcps.json" \
my_app.py
# S3 emitter
spark-submit \
--conf "spark.datahub.emitter=s3" \
--conf "spark.datahub.s3.bucket=my-datahub-bucket" \
--conf "spark.datahub.s3.prefix=lineage/" \
--conf "spark.datahub.s3.region=us-east-1" \
my_app.py
Knowledge Sources
- DataHub GitHub Repository
- OpenLineage Documentation
- Apache Spark Configuration Documentation
- Typesafe Config (HOCON) Library
Related
- Implemented by: Datahub_project_Datahub_SparkConfigParser_ParseSparkConfig
Implementation:Datahub_project_Datahub_SparkConfigParser_ParseSparkConfig