Workflow:Apache Spark Application Submission
| Knowledge Sources | |
|---|---|
| Domains | Application_Deployment, Cluster_Computing, Data_Engineering |
| Last Updated | 2026-02-08 22:00 GMT |
Overview
End-to-end process for packaging, configuring, and submitting a Spark application to a cluster using spark-submit.
Description
This workflow covers the standard procedure for taking a Spark application (written in Scala, Java, or Python) from development to execution on a cluster. It encompasses dependency bundling, the spark-submit command-line interface, cluster manager selection (standalone, YARN, Kubernetes, or local), deploy mode configuration (client vs cluster), and runtime configuration management. The spark-submit script provides a uniform interface across all supported cluster managers.
Usage
Execute this workflow when you have a self-contained Spark application that needs to be deployed and run on a Spark cluster. This applies to data engineers submitting batch jobs, data scientists running analytics pipelines, and operations teams deploying production workloads.
Execution Steps
Step 1: Application Development
Write the Spark application using the SparkSession API. The application initializes a SparkSession, performs data transformations and actions, and cleanly stops the session on completion. The application can be written in Scala, Java, or Python.
Key considerations:
- Use SparkSession.builder to create the entry point
- Define the application name via appName()
- For Scala/Java, define a main() method (do not extend scala.App)
- For Python, the .py file serves directly as the application entry point
Step 2: Dependency Bundling
Package the application code and its dependencies into a deployable artifact. For JVM languages, create an assembly (uber) JAR containing application code and third-party libraries, marking Spark and Hadoop as "provided" dependencies since they are supplied by the cluster. For Python, package dependencies into .zip or .egg files.
Key considerations:
- Use sbt-assembly or Maven Shade plugin for JVM uber JARs
- Mark Spark and Hadoop dependencies as "provided" scope
- For Python, use --py-files to distribute .zip, .egg, or .py files
- The application JAR URL must be globally visible across the cluster
Step 3: Cluster Configuration
Select the target cluster manager and deploy mode. The master URL determines which cluster manager Spark connects to. The deploy mode controls whether the driver runs on a worker node (cluster mode) or on the submitting machine (client mode).
Key considerations:
- Master URL formats: local[N], spark://host:port, yarn, k8s://host:port
- Client mode attaches driver I/O to the console (suitable for interactive use)
- Cluster mode minimizes network latency between driver and executors
- Configuration can be loaded from conf/spark-defaults.conf or via --properties-file
Step 4: Application Submission
Submit the application using the bin/spark-submit script. Provide the application JAR or Python file, master URL, deploy mode, resource configuration (executor memory, cores, number of executors), and any application arguments. The script handles classpath setup and configuration propagation.
Key considerations:
- --class specifies the main class for JVM applications
- --executor-memory and --num-executors control resource allocation
- --conf passes arbitrary Spark configuration properties
- Additional JARs can be included via --jars (comma-separated URLs)
Step 5: Monitoring and Completion
Monitor the running application through the Spark Web UI and application logs. In client mode, stdout/stderr are visible in the submitting terminal. In cluster mode, logs are available through the cluster manager's UI. The application terminates when the main method completes or an unhandled exception occurs.
Key considerations:
- Spark Web UI runs on port 4040 by default for the driver
- Use --verbose flag on spark-submit for debugging configuration issues
- In standalone cluster mode, --supervise enables automatic driver restart on failure
- YARN and Kubernetes provide their own log aggregation mechanisms