Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Workflow:Apache Spark Kubernetes Deployment

From Leeroopedia


Knowledge Sources
Domains Kubernetes, Container_Orchestration, Cloud_Native
Last Updated 2026-02-08 22:00 GMT

Overview

End-to-end process for deploying and running Spark applications on a Kubernetes cluster, from Docker image creation through application submission and lifecycle management.

Description

This workflow covers running Spark applications natively on Kubernetes, using the built-in Kubernetes scheduler. Spark creates driver and executor pods directly within the Kubernetes cluster, leveraging native K8s features for resource management, pod scheduling, and lifecycle handling. The workflow includes building custom Docker images, configuring Kubernetes-specific settings, submitting applications via spark-submit with the k8s:// master URL, and managing pod templates for advanced scheduling requirements.

Usage

Execute this workflow when you need to run Spark workloads on an existing Kubernetes infrastructure. This is suitable for organizations that have standardized on Kubernetes for container orchestration, need dynamic resource allocation across multiple workloads, require integration with cloud-native tooling, or want to leverage Kubernetes features like RBAC, namespaces, and resource quotas for Spark jobs.

Execution Steps

Step 1: Prerequisites Verification

Verify that the Kubernetes cluster meets the requirements for running Spark. This includes checking the cluster version, verifying kubectl access, ensuring proper RBAC permissions (list, create, edit, delete pods), and confirming Kubernetes DNS is configured.

Key considerations:

  • Kubernetes cluster version >= 1.33 required
  • Service account must have permissions to create pods, services, and configmaps
  • Verify access with kubectl auth can-i commands
  • For local testing, minikube with at least 3 CPUs and 4GB memory is recommended

Step 2: Docker Image Build

Build Docker images containing the Spark runtime using bin/docker-image-tool.sh. The tool builds base JVM images by default, with optional PySpark and SparkR images. Images can be customized with additional dependencies, and the tool supports pushing to container registries.

Key considerations:

  • Default images support JVM-based applications
  • Use -p flag with Python Dockerfile for PySpark images
  • Use -R flag with R Dockerfile for SparkR images
  • Pre-built Apache Spark Docker images are available on Docker Hub
  • Custom USER directives can set the container UID for security

Step 3: Kubernetes Configuration

Configure Kubernetes-specific Spark settings including namespace, service account, container image references, resource requests/limits, and optional features like volume mounts and pod templates. Pod templates allow fine-grained control over pod specifications including node selectors, tolerations, and security contexts.

Key considerations:

  • Set spark.kubernetes.container.image to your built image
  • Configure spark.kubernetes.namespace for workload isolation
  • Use pod templates for advanced scheduling (node affinity, tolerations)
  • RBAC configuration may be needed for the driver service account
  • Volume mounts support hostPath, emptyDir, NFS, and PVC types

Step 4: Application Submission

Submit the Spark application using spark-submit with the k8s:// master URL prefix. The master URL points to the Kubernetes API server. Spark creates the driver pod, which then creates executor pods. The application JAR or Python file must be accessible from within the container (local:// scheme for bundled files, or remote URLs).

Key considerations:

  • Master URL format: k8s://https://API_SERVER_HOST:PORT
  • Use --deploy-mode cluster for standard K8s deployments
  • Application files with local:// scheme must be pre-bundled in the Docker image
  • Use kubectl proxy for simplified API server access during development
  • Application names must use lowercase alphanumeric, hyphens, and dots only

Step 5: Monitoring and Lifecycle

Monitor running applications through Kubernetes-native tools (kubectl logs, kubectl get pods) and the Spark Web UI. Driver pods persist after completion for log inspection. Executor pods are automatically cleaned up when the application completes. The driver pod's OwnerReference mechanism ensures executor cleanup even on failure.

Key considerations:

  • Executor pods terminate automatically when the application completes
  • Driver pods remain in "completed" state for log access until garbage collected
  • Set spark.kubernetes.driver.pod.name when running driver inside a pod (client mode)
  • Graceful decommissioning is supported for executor pods via decom.sh
  • The Kubernetes entrypoint script handles signal forwarding for clean shutdown

Execution Diagram

GitHub URL

Workflow Repository