Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Apache Spark K8s Application Submission

From Leeroopedia


Metadata Value
Domains Kubernetes, Deployment
Type Principle
Related Implementation:Apache_Spark_Spark_Submit_K8s

Overview

A container-orchestrated application submission pattern that creates driver and executor pods in Kubernetes using the k8s:// master URL protocol.

Description

Spark on Kubernetes submission creates the driver as a Kubernetes pod, which then requests executor pods from the Kubernetes API server. The k8s:// master URL scheme tells Spark to use the Kubernetes cluster manager instead of Standalone, YARN, or Mesos.

The submission workflow operates as follows:

  1. The submission client (spark-submit) contacts the Kubernetes API server at the URL specified by --master k8s://https://<host>:<port>.
  2. In cluster mode, the submission client creates a driver pod in the Kubernetes cluster. The driver process runs inside this pod.
  3. The driver pod requests executor pods from the Kubernetes API server based on spark.executor.instances.
  4. Executors register with the driver and begin executing tasks.
  5. When the application completes, executor pods are terminated and cleaned up. The driver pod remains in "completed" state for log inspection until it is garbage collected or manually deleted.

Two deployment modes are supported:

  • Cluster mode (recommended for production) -- The driver runs as a Kubernetes pod. The submission client exits after creating the driver pod.
  • Client mode (useful for debugging) -- The driver runs locally on the submission machine. Only executors are created as Kubernetes pods. The driver must be network-reachable from executor pods.

The local:// JAR scheme is a Kubernetes-specific feature that references JARs already present inside the container image, avoiding the need to transfer JARs at submission time.

Usage

Use to submit Spark applications to Kubernetes clusters:

  • Production workloads -- Use cluster mode so the driver lifecycle is managed by Kubernetes.
  • Interactive debugging -- Use client mode to run the driver locally with direct access to driver logs and the Spark UI.
  • Pre-packaged applications -- Use local:// URIs to reference application JARs baked into the container image.
  • Remote dependencies -- Use hdfs://, s3a://, or http:// URIs for JARs hosted externally.

Theoretical Basis

The pod-based execution model follows a create-request-register-execute lifecycle:

submit(driver_pod)
  -> driver_pod.request(executor_pods, N)
    -> executors.register(driver)
      -> execute_tasks
        -> cleanup(executor_pods)

Key constraints of this model:

  • The port must always be specified in the master URL, even if it is the standard HTTPS port 443.
  • The application name must be lowercase alphanumeric (with - and . allowed) because it is used to name Kubernetes resources, which have strict naming requirements.
  • In client mode, the driver must be network-routable from executor pods, which may require a headless Kubernetes service.

Deployment Mode Comparison

Aspect Cluster Mode Client Mode
Driver location Kubernetes pod Local machine or external pod
Executor location Kubernetes pods Kubernetes pods
Recommended for Production Debugging, interactive use
Driver lifecycle Managed by Kubernetes Managed by user
Network requirement API server reachable from client Driver reachable from executor pods
Spark UI access Via port-forward or ingress Direct on local machine

Related

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment