Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Datahub project Datahub Spark Connection Configuration

From Leeroopedia


Metadata

Field Value
principle_name Spark Connection Configuration
description The process of configuring the DataHub connection parameters within Spark's configuration namespace for lineage emission
type Principle
status Active
last_updated 2026-02-10
domains Data_Lineage, Apache_Spark, Metadata_Management
repository datahub-project/datahub

Overview

Spark Connection Configuration is the process of configuring the DataHub connection parameters within Spark's configuration namespace for lineage emission. The configuration system extracts spark.datahub.* properties from SparkConf and converts them to a HOCON Config object. It supports four emitter types (rest, kafka, file, s3) with type-specific connection parameters.

Description

The DataHub Spark lineage agent reuses Spark's native configuration system to carry all DataHub-specific settings. Properties are specified under the spark.datahub.* prefix, which the SparkConfigParser class strips and parses into a Typesafe Config object.

This design eliminates the need for separate configuration files, environment variables, or external configuration services. All settings travel with the Spark application's configuration, whether specified via command-line --conf flags, spark-defaults.conf, or programmatic SparkConf API calls.

The configuration supports four distinct emitter types, each with its own parameter set:

REST Emitter (Default)

Sends MCPs to the DataHub GMS REST API:

Property Config Key Default Description
spark.datahub.rest.server rest.server http://localhost:8080 GMS server URL
spark.datahub.rest.token rest.token null Authentication token
spark.datahub.rest.disable_ssl_verification rest.disable_ssl_verification false Disable SSL certificate verification
spark.datahub.rest.disable_chunked_encoding rest.disable_chunked_encoding false Disable HTTP chunked transfer encoding
spark.datahub.rest.max_retries rest.max_retries 0 Maximum number of retry attempts
spark.datahub.rest.retry_interval_in_sec rest.retry_interval_in_sec 5 Seconds between retries

Kafka Emitter

Sends MCPs to a Kafka topic for asynchronous ingestion:

Property Config Key Description
spark.datahub.kafka.bootstrap kafka.bootstrap Kafka bootstrap servers
spark.datahub.kafka.schema_registry_url kafka.schema_registry_url Schema Registry URL
spark.datahub.kafka.mcp_topic kafka.mcp_topic Target MCP topic name
spark.datahub.kafka.schema_registry_config.* kafka.schema_registry_config.* Schema Registry configuration properties
spark.datahub.kafka.producer_config.* kafka.producer_config.* Kafka producer configuration properties

File Emitter

Writes MCPs to a local file:

Property Config Key Description
spark.datahub.file.filename file.filename Output file path

S3 Emitter

Writes MCPs to Amazon S3:

Property Config Key Description
spark.datahub.s3.bucket s3.bucket S3 bucket name
spark.datahub.s3.prefix s3.prefix Object key prefix
spark.datahub.s3.region s3.region AWS region
spark.datahub.s3.profile s3.profile AWS credential profile name
spark.datahub.s3.endpoint s3.endpoint Custom S3 endpoint URL
spark.datahub.s3.access_key s3.access_key AWS access key
spark.datahub.s3.secret_key s3.secret_key AWS secret key
spark.datahub.s3.filename s3.filename Output file name within S3

Theoretical Basis

This principle follows the namespace-scoped configuration pattern, where Spark's existing configuration system is leveraged to carry DataHub-specific settings under the spark.datahub.* prefix. This avoids separate config files and ensures that configuration travels with the application.

The configuration is parsed using Typesafe Config (HOCON format), which provides:

  • Hierarchical structure: Dotted keys are parsed into a tree structure
  • Type safety: Values are parsed as their expected types (string, boolean, int)
  • Defaults: Missing values fall back to sensible defaults
  • Overriding: Properties can be overridden at any level (command line, config file, code)

The emitter type selection follows the Strategy pattern, where the emitter configuration key selects which emitter implementation to instantiate.

Usage

This principle applies when configuring how the Spark lineage agent connects to DataHub for metadata emission.

# REST emitter (default)
spark-submit \
  --conf "spark.datahub.rest.server=https://datahub.example.com:8080" \
  --conf "spark.datahub.rest.token=eyJ..." \
  --conf "spark.datahub.rest.max_retries=3" \
  my_app.py

# Kafka emitter
spark-submit \
  --conf "spark.datahub.emitter=kafka" \
  --conf "spark.datahub.kafka.bootstrap=kafka:9092" \
  --conf "spark.datahub.kafka.schema_registry_url=http://schema-registry:8081" \
  my_app.py

# File emitter
spark-submit \
  --conf "spark.datahub.emitter=file" \
  --conf "spark.datahub.file.filename=/tmp/datahub_mcps.json" \
  my_app.py

# S3 emitter
spark-submit \
  --conf "spark.datahub.emitter=s3" \
  --conf "spark.datahub.s3.bucket=my-datahub-bucket" \
  --conf "spark.datahub.s3.prefix=lineage/" \
  --conf "spark.datahub.s3.region=us-east-1" \
  my_app.py

Knowledge Sources

Related

Implementation:Datahub_project_Datahub_SparkConfigParser_ParseSparkConfig

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment