Principle:Spotify Luigi Spark Configuration

Template:Knowledge Sources Template:Domains Template:Last Updated

Overview

Spark Configuration is the practice of declaring the runtime environment, cluster manager, and submission parameters that govern how a distributed Spark job is launched.

Description

When a distributed processing framework like Apache Spark is used to run analytical workloads, each job must be accompanied by a set of configuration values that determine where the job runs, how it connects to the cluster, and which supplementary artifacts (JARs, Python files, property files) are included with the submission. Spark Configuration encompasses the entire surface area of settings passed to the spark-submit launcher, including:

Cluster manager target -- selecting among local mode, standalone cluster, Apache Mesos, or YARN.
Deploy mode -- choosing whether the driver runs on the submitting machine (client mode) or inside the cluster (cluster mode).
Application identity -- assigning a human-readable name and, for JVM applications, specifying the main entry-point class.
Supplementary files -- attaching additional JARs, Python modules, generic files, and archive bundles that are distributed to worker nodes.
Arbitrary Spark properties -- passing key-value pairs through the --conf flag or via a dedicated properties file.

In a pipeline orchestration context, externalising these settings from application code is essential. It allows the same analytical logic to be deployed against a local development cluster, a staging environment, and a production YARN cluster without changing the task definition itself. The configuration layer acts as the contract between the orchestrator and the cluster manager.

Usage

Apply Spark Configuration when:

You need to target different cluster managers (local, standalone, Mesos, YARN) across environments.
Your Spark jobs require additional dependencies (JARs, Python packages, or files) that must be shipped to executors.
You want to centralise Spark submission settings in a configuration file so that individual task definitions remain environment-agnostic.
You are submitting Spark applications from a workflow orchestrator and must programmatically construct the spark-submit command line.

Theoretical Basis

Spark Configuration follows the externalised configuration pattern common in twelve-factor application design. The algorithm is straightforward:

Resolve defaults -- Each configuration property has a default value (often None, meaning the flag is omitted from the command).
Layer overrides -- A configuration source (file, environment variable, or class attribute) may override any default. In Luigi, this is the [spark] section in luigi.cfg.
Assemble the command -- At submission time, each non-None property is mapped to its corresponding spark-submit CLI flag. List-valued properties (e.g., multiple JARs) are joined with commas. Dictionary-valued properties (e.g., arbitrary --conf pairs) are expanded into repeated flag-value pairs.
Delegate to spark-submit -- The fully assembled argument list is handed to the operating system via subprocess.Popen.

This separation of what to run from where and how to run it is the core principle, enabling reproducible, environment-portable Spark job definitions.

Related Pages

Implementation:Spotify_Luigi_SparkSubmitTask_Config

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment