Principle:Spotify Luigi Spark Configuration
Template:Knowledge Sources
Template:Domains
Template:Last Updated
Overview
Spark Configuration is the practice of declaring the runtime environment, cluster manager, and submission parameters that govern how a distributed Spark job is launched.
Description
When a distributed processing framework like Apache Spark is used to run analytical workloads, each job must be accompanied by a set of configuration values that determine where the job runs, how it connects to the cluster, and which supplementary artifacts (JARs, Python files, property files) are included with the submission. Spark Configuration encompasses the entire surface area of settings passed to the spark-submit launcher, including:
- Cluster manager target -- selecting among local mode, standalone cluster, Apache Mesos, or YARN.
- Deploy mode -- choosing whether the driver runs on the submitting machine (client mode) or inside the cluster (cluster mode).
- Application identity -- assigning a human-readable name and, for JVM applications, specifying the main entry-point class.
- Supplementary files -- attaching additional JARs, Python modules, generic files, and archive bundles that are distributed to worker nodes.
- Arbitrary Spark properties -- passing key-value pairs through the
--confflag or via a dedicated properties file.
In a pipeline orchestration context, externalising these settings from application code is essential. It allows the same analytical logic to be deployed against a local development cluster, a staging environment, and a production YARN cluster without changing the task definition itself. The configuration layer acts as the contract between the orchestrator and the cluster manager.
Usage
Apply Spark Configuration when:
- You need to target different cluster managers (local, standalone, Mesos, YARN) across environments.
- Your Spark jobs require additional dependencies (JARs, Python packages, or files) that must be shipped to executors.
- You want to centralise Spark submission settings in a configuration file so that individual task definitions remain environment-agnostic.
- You are submitting Spark applications from a workflow orchestrator and must programmatically construct the
spark-submitcommand line.
Theoretical Basis
Spark Configuration follows the externalised configuration pattern common in twelve-factor application design. The algorithm is straightforward:
- Resolve defaults -- Each configuration property has a default value (often
None, meaning the flag is omitted from the command). - Layer overrides -- A configuration source (file, environment variable, or class attribute) may override any default. In Luigi, this is the
[spark]section inluigi.cfg. - Assemble the command -- At submission time, each non-
Noneproperty is mapped to its correspondingspark-submitCLI flag. List-valued properties (e.g., multiple JARs) are joined with commas. Dictionary-valued properties (e.g., arbitrary--confpairs) are expanded into repeated flag-value pairs. - Delegate to spark-submit -- The fully assembled argument list is handed to the operating system via
subprocess.Popen.
This separation of what to run from where and how to run it is the core principle, enabling reproducible, environment-portable Spark job definitions.