Principle:Apache Spark Cluster Configuration
Metadata
| Field | Value |
|---|---|
| Domains | Configuration, Deployment |
Overview
A configuration pattern that specifies the target cluster manager and execution mode for distributed application submission through a set of standardized parameters.
Description
Spark applications can run on multiple cluster managers (Standalone, YARN, Kubernetes, Mesos). The cluster configuration pattern abstracts deployment details behind a uniform --master URL and --deploy-mode flag.
- The master URL determines which cluster manager handles resource allocation
- The deploy-mode determines whether the driver runs on the submission machine (client) or on the cluster (cluster mode)
This abstraction provides several benefits:
- Portability — the same application code runs on any supported cluster manager without modification
- Separation of concerns — application logic is decoupled from deployment topology
- Uniform interface — a single submission command works across all cluster managers
- Flexible scaling — switching from local development to a production cluster requires only changing the master URL
Client Mode vs. Cluster Mode
| Aspect | Client Mode | Cluster Mode |
|---|---|---|
| Driver location | Submission machine | Cluster worker node |
| Console output | Visible locally | Redirected to cluster logs |
| Network dependency | Must stay connected | Can disconnect after submission |
| Use case | Interactive development, debugging | Production jobs, automated pipelines |
Usage
Use this to configure where and how your Spark application executes. Client mode is preferred for interactive work; cluster mode for production jobs where the submission machine may disconnect.
Theoretical Basis
The master URL scheme acts as a service locator pattern. The URL format determines which cluster manager implementation is instantiated:
| Master URL | Cluster Manager |
|---|---|
| local[N] | Local mode (N threads) |
| spark://host:port | Standalone cluster manager |
| yarn | Apache YARN |
| k8s://host:port | Kubernetes |
This is analogous to JDBC connection strings where the URL scheme determines the database driver. The Spark submission layer parses the master URL and delegates to the appropriate cluster manager backend.