Principle:Apache Spark Container Image Build

Metadata	Value
Domains	Kubernetes, Containerization
Type	Principle
Related	Implementation:Apache_Spark_Docker_Image_Tool

Overview

A container image build process that packages a Spark distribution with language runtimes into Docker images suitable for Kubernetes pod execution.

Description

Running Spark on Kubernetes requires container images containing the Spark distribution, Java runtime, and optionally Python or R runtimes. The image build process follows a layered construction approach that creates a base Spark image and then derives language-specific images from it.

The build process involves the following stages:

Base image construction -- The base Spark image includes the Spark JARs, shell scripts (bin/, sbin/), example JARs, the Kubernetes-specific entrypoint script, and the tini init process. The base Dockerfile is located at kubernetes/dockerfiles/spark/Dockerfile.
Language binding derivation -- PySpark and SparkR images are built as derived images that use the base Spark image as their parent (--build-arg base_img=spark:<tag>). The PySpark image adds the Python runtime and PySpark libraries; the SparkR image adds R and SparkR packages.
Cross-platform builds -- For heterogeneous clusters containing both amd64 and arm64 nodes, cross-platform builds using docker buildx produce multi-architecture manifests that allow Kubernetes to pull the correct image for each node's architecture.

The base image sets a default UID of 185 for the Spark process user, following the principle of least privilege. This can be overridden during build time for security-conscious deployments.

Usage

Use this pattern when preparing to deploy Spark on Kubernetes:

Initial setup -- Build images once and push to a container registry accessible by the Kubernetes cluster.
Version upgrades -- Rebuild images when upgrading Spark versions or changing language runtime versions.
Custom dependencies -- Extend the base Dockerfile to include additional JARs or libraries required by your Spark applications.
Development workflows -- Use Minikube's Docker daemon directly to avoid pushing images to a remote registry during development.

Theoretical Basis

The layered image construction follows a derivation pattern:

base_image(jars, scripts, entrypoint)
  -> derive(python_image)
  -> derive(r_image)

Cross-platform builds extend this with architecture multiplexing:

buildx(platforms=[amd64, arm64])
  -> build_per_platform(base_image)
  -> create_manifest_list
  -> push

For development (non-release) builds, a temporary build context is created to avoid uploading the entire source tree to the Docker daemon, which can be very large due to test logs and build artifacts.

Image Architecture

Image	Base	Added Content	Dockerfile
`spark:<tag>`	JDK base	Spark JARs, bin/, sbin/, entrypoint.sh	`kubernetes/dockerfiles/spark/Dockerfile`
`spark-py:<tag>`	`spark:<tag>`	Python runtime, PySpark libraries	`kubernetes/dockerfiles/spark/bindings/python/Dockerfile`
`spark-r:<tag>`	`spark:<tag>`	R runtime, SparkR packages	`kubernetes/dockerfiles/spark/bindings/R/Dockerfile`

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment