Principle:Apache Spark Standalone Job Submission
| Field | Value |
|---|---|
| Domains | Deployment, Execution |
| Type | Principle |
| Related | Implementation:Apache_Spark_Spark_Submit_Standalone |
Overview
A job submission pattern that routes application execution to a Spark standalone cluster through the spark:// master URL protocol.
Description
Once a standalone cluster is running, applications are submitted to it using the spark:// URL scheme. The submission process involves several stages:
- Driver connection -- the driver program connects to the master at the spark:// address
- Resource allocation -- the master allocates resources from registered workers based on application requirements
- Executor launch -- workers launch executor JVMs that perform the actual computation
- Task coordination -- the driver distributes tasks to executors and collects results
Two deploy modes are supported:
- Client mode -- the driver runs on the submission machine, suitable for interactive use and debugging
- Cluster mode -- the master spawns the driver on a worker node, suitable for production and long-running applications
The standalone scheduler supports FIFO and fair scheduling across multiple concurrent applications.
Usage
Use after starting a standalone master and workers to run Spark applications on the cluster. Common scenarios include:
- Interactive analysis -- submitting from a user workstation in client mode
- Production pipelines -- submitting in cluster mode for fault-tolerant execution
- Batch processing -- submitting multiple applications that share cluster resources
Theoretical Basis
The resource negotiation follows a multi-phase protocol:
submit(app, master_url)
-> master.allocate(resources)
-> workers.launch(executors)
-> driver.coordinate(tasks)
| Phase | Actor | Action |
|---|---|---|
| Submit | Client | Sends application to master via spark:// protocol |
| Allocate | Master | Assigns worker resources to the application |
| Launch | Workers | Start executor JVMs for computation |
| Coordinate | Driver | Distributes tasks and collects results |