Workflow:Apache Spark Kubernetes Deployment

Knowledge Sources	Apache Spark Running Spark on Kubernetes Docker Image Tool
Domains	Kubernetes, Container_Orchestration, Cloud_Native
Last Updated	2026-02-08 22:00 GMT

Overview

End-to-end process for deploying and running Spark applications on a Kubernetes cluster, from Docker image creation through application submission and lifecycle management.

Description

This workflow covers running Spark applications natively on Kubernetes, using the built-in Kubernetes scheduler. Spark creates driver and executor pods directly within the Kubernetes cluster, leveraging native K8s features for resource management, pod scheduling, and lifecycle handling. The workflow includes building custom Docker images, configuring Kubernetes-specific settings, submitting applications via spark-submit with the k8s:// master URL, and managing pod templates for advanced scheduling requirements.

Usage

Execute this workflow when you need to run Spark workloads on an existing Kubernetes infrastructure. This is suitable for organizations that have standardized on Kubernetes for container orchestration, need dynamic resource allocation across multiple workloads, require integration with cloud-native tooling, or want to leverage Kubernetes features like RBAC, namespaces, and resource quotas for Spark jobs.

Execution Steps

Step 1: Prerequisites Verification

Verify that the Kubernetes cluster meets the requirements for running Spark. This includes checking the cluster version, verifying kubectl access, ensuring proper RBAC permissions (list, create, edit, delete pods), and confirming Kubernetes DNS is configured.

Key considerations:

Kubernetes cluster version >= 1.33 required
Service account must have permissions to create pods, services, and configmaps
Verify access with kubectl auth can-i commands
For local testing, minikube with at least 3 CPUs and 4GB memory is recommended

Step 2: Docker Image Build

Build Docker images containing the Spark runtime using bin/docker-image-tool.sh. The tool builds base JVM images by default, with optional PySpark and SparkR images. Images can be customized with additional dependencies, and the tool supports pushing to container registries.

Key considerations:

Default images support JVM-based applications
Use -p flag with Python Dockerfile for PySpark images
Use -R flag with R Dockerfile for SparkR images
Pre-built Apache Spark Docker images are available on Docker Hub
Custom USER directives can set the container UID for security

Step 3: Kubernetes Configuration

Configure Kubernetes-specific Spark settings including namespace, service account, container image references, resource requests/limits, and optional features like volume mounts and pod templates. Pod templates allow fine-grained control over pod specifications including node selectors, tolerations, and security contexts.

Key considerations:

Set spark.kubernetes.container.image to your built image
Configure spark.kubernetes.namespace for workload isolation
Use pod templates for advanced scheduling (node affinity, tolerations)
RBAC configuration may be needed for the driver service account
Volume mounts support hostPath, emptyDir, NFS, and PVC types

Step 4: Application Submission

Submit the Spark application using spark-submit with the k8s:// master URL prefix. The master URL points to the Kubernetes API server. Spark creates the driver pod, which then creates executor pods. The application JAR or Python file must be accessible from within the container (local:// scheme for bundled files, or remote URLs).

Key considerations:

Master URL format: k8s://https://API_SERVER_HOST:PORT
Use --deploy-mode cluster for standard K8s deployments
Application files with local:// scheme must be pre-bundled in the Docker image
Use kubectl proxy for simplified API server access during development
Application names must use lowercase alphanumeric, hyphens, and dots only

Step 5: Monitoring and Lifecycle

Monitor running applications through Kubernetes-native tools (kubectl logs, kubectl get pods) and the Spark Web UI. Driver pods persist after completion for log inspection. Executor pods are automatically cleaned up when the application completes. The driver pod's OwnerReference mechanism ensures executor cleanup even on failure.

Key considerations:

Executor pods terminate automatically when the application completes
Driver pods remain in "completed" state for log access until garbage collected
Set spark.kubernetes.driver.pod.name when running driver inside a pod (client mode)
Graceful decommissioning is supported for executor pods via decom.sh
The Kubernetes entrypoint script handles signal forwarding for clean shutdown

Execution Diagram

GitHub URL

Workflow Repository