Principle:Spotify Luigi Spark Resource Configuration
Template:Knowledge Sources
Template:Domains
Template:Last Updated
Overview
Spark Resource Configuration is the practice of declaring the computational resources -- memory, CPU cores, and executor count -- allocated to the driver and executor processes of a distributed Spark job.
Description
Every Spark application runs as two kinds of JVM (or Python) processes:
- Driver -- The single process that hosts the user's main function, creates the
SparkContext, and orchestrates the execution plan. - Executors -- The worker processes that run on cluster nodes, execute individual tasks (partitions of a stage), and cache data in memory.
Resource configuration determines how much memory and how many CPU cores each of these processes is allowed to consume, as well as how many executor instances the cluster manager should launch. Incorrect resource settings are one of the most common causes of Spark job failures:
- Too little driver memory leads to
OutOfMemoryErrorwhen the driver collects large results or builds a complex execution plan. - Too little executor memory causes excessive garbage-collection pauses or spill-to-disk, dramatically slowing computation.
- Too few executor cores under-utilises available cluster capacity.
- Too many executors or cores can starve other jobs in a shared cluster, or exceed the cluster manager's resource limits.
Beyond memory and cores, resource configuration also includes:
- Driver Java options -- JVM flags for the driver (e.g., garbage-collector tuning).
- Driver library path and classpath -- Additional native libraries or JARs loaded by the driver.
- Queue -- For YARN clusters, the scheduler queue into which the application is placed.
- Supervise mode -- For standalone clusters, whether the cluster manager should automatically restart the driver on failure.
Usage
Configure Spark resources when:
- You are running Spark jobs on a shared cluster and must fit within resource quotas.
- Your job processes large datasets that demand careful memory tuning for both driver and executors.
- You need to scale executor count or core allocation up or down based on data volume.
- You are targeting a YARN cluster and must route jobs to a specific scheduler queue.
- You want driver-level JVM tuning (heap size, GC algorithm) without modifying Spark's global defaults.
Theoretical Basis
Spark Resource Configuration is governed by the resource negotiation model used by cluster managers:
- Request phase -- The Spark driver sends a resource request to the cluster manager, specifying the number of executors, cores per executor, and memory per executor. For YARN, this request also includes a queue name.
- Allocation phase -- The cluster manager examines available resources across the cluster and grants (or queues) the request. YARN uses capacity scheduler or fair scheduler policies; standalone mode uses first-come-first-served.
- Enforcement phase -- Once allocated, each executor container is hard-limited to its declared memory. If an executor exceeds its memory limit, the cluster manager terminates the container. The driver is similarly constrained.
The key resource parameters map to spark-submit flags as follows:
| Parameter | Flag | Scope |
|---|---|---|
| Driver memory | --driver-memory |
Driver JVM heap size (e.g., 2g, 4096m).
|
| Driver cores | --driver-cores |
Number of cores for the driver (cluster mode only). |
| Executor memory | --executor-memory |
Memory per executor JVM (e.g., 3g).
|
| Executor cores | --executor-cores |
Cores per executor (YARN / standalone). |
| Number of executors | --num-executors |
Total executor instances to launch (YARN). |
| Total executor cores | --total-executor-cores |
Aggregate cores across all executors (standalone / Mesos). |
| Queue | --queue |
YARN scheduler queue name. |
| Supervise | --supervise |
Auto-restart driver on failure (standalone cluster mode). |
Proper resource tuning is iterative: monitor job metrics (shuffle spill, GC time, task duration), adjust parameters, and rerun until the job completes reliably within its time and resource budgets.