Workflow:Apache Spark Application Submission

Knowledge Sources	Apache Spark Submitting Applications Cluster Overview
Domains	Application_Deployment, Cluster_Computing, Data_Engineering
Last Updated	2026-02-08 22:00 GMT

Overview

End-to-end process for packaging, configuring, and submitting a Spark application to a cluster using spark-submit.

Description

This workflow covers the standard procedure for taking a Spark application (written in Scala, Java, or Python) from development to execution on a cluster. It encompasses dependency bundling, the spark-submit command-line interface, cluster manager selection (standalone, YARN, Kubernetes, or local), deploy mode configuration (client vs cluster), and runtime configuration management. The spark-submit script provides a uniform interface across all supported cluster managers.

Usage

Execute this workflow when you have a self-contained Spark application that needs to be deployed and run on a Spark cluster. This applies to data engineers submitting batch jobs, data scientists running analytics pipelines, and operations teams deploying production workloads.

Execution Steps

Step 1: Application Development

Write the Spark application using the SparkSession API. The application initializes a SparkSession, performs data transformations and actions, and cleanly stops the session on completion. The application can be written in Scala, Java, or Python.

Key considerations:

Use SparkSession.builder to create the entry point
Define the application name via appName()
For Scala/Java, define a main() method (do not extend scala.App)
For Python, the .py file serves directly as the application entry point

Step 2: Dependency Bundling

Package the application code and its dependencies into a deployable artifact. For JVM languages, create an assembly (uber) JAR containing application code and third-party libraries, marking Spark and Hadoop as "provided" dependencies since they are supplied by the cluster. For Python, package dependencies into .zip or .egg files.

Key considerations:

Use sbt-assembly or Maven Shade plugin for JVM uber JARs
Mark Spark and Hadoop dependencies as "provided" scope
For Python, use --py-files to distribute .zip, .egg, or .py files
The application JAR URL must be globally visible across the cluster

Step 3: Cluster Configuration

Select the target cluster manager and deploy mode. The master URL determines which cluster manager Spark connects to. The deploy mode controls whether the driver runs on a worker node (cluster mode) or on the submitting machine (client mode).

Key considerations:

Master URL formats: local[N], spark://host:port, yarn, k8s://host:port
Client mode attaches driver I/O to the console (suitable for interactive use)
Cluster mode minimizes network latency between driver and executors
Configuration can be loaded from conf/spark-defaults.conf or via --properties-file

Step 4: Application Submission

Submit the application using the bin/spark-submit script. Provide the application JAR or Python file, master URL, deploy mode, resource configuration (executor memory, cores, number of executors), and any application arguments. The script handles classpath setup and configuration propagation.

Key considerations:

--class specifies the main class for JVM applications
--executor-memory and --num-executors control resource allocation
--conf passes arbitrary Spark configuration properties
Additional JARs can be included via --jars (comma-separated URLs)

Step 5: Monitoring and Completion

Monitor the running application through the Spark Web UI and application logs. In client mode, stdout/stderr are visible in the submitting terminal. In cluster mode, logs are available through the cluster manager's UI. The application terminates when the main method completes or an unhandled exception occurs.

Key considerations:

Spark Web UI runs on port 4040 by default for the driver
Use --verbose flag on spark-submit for debugging configuration issues
In standalone cluster mode, --supervise enables automatic driver restart on failure
YARN and Kubernetes provide their own log aggregation mechanisms

Execution Diagram

GitHub URL

Workflow Repository