Principle:Apache Spark Container Image Build
| Metadata | Value |
|---|---|
| Domains | Kubernetes, Containerization |
| Type | Principle |
| Related | Implementation:Apache_Spark_Docker_Image_Tool |
Overview
A container image build process that packages a Spark distribution with language runtimes into Docker images suitable for Kubernetes pod execution.
Description
Running Spark on Kubernetes requires container images containing the Spark distribution, Java runtime, and optionally Python or R runtimes. The image build process follows a layered construction approach that creates a base Spark image and then derives language-specific images from it.
The build process involves the following stages:
- Base image construction -- The base Spark image includes the Spark JARs, shell scripts (
bin/,sbin/), example JARs, the Kubernetes-specific entrypoint script, and thetiniinit process. The base Dockerfile is located atkubernetes/dockerfiles/spark/Dockerfile. - Language binding derivation -- PySpark and SparkR images are built as derived images that use the base Spark image as their parent (
--build-arg base_img=spark:<tag>). The PySpark image adds the Python runtime and PySpark libraries; the SparkR image adds R and SparkR packages. - Cross-platform builds -- For heterogeneous clusters containing both amd64 and arm64 nodes, cross-platform builds using
docker buildxproduce multi-architecture manifests that allow Kubernetes to pull the correct image for each node's architecture.
The base image sets a default UID of 185 for the Spark process user, following the principle of least privilege. This can be overridden during build time for security-conscious deployments.
Usage
Use this pattern when preparing to deploy Spark on Kubernetes:
- Initial setup -- Build images once and push to a container registry accessible by the Kubernetes cluster.
- Version upgrades -- Rebuild images when upgrading Spark versions or changing language runtime versions.
- Custom dependencies -- Extend the base Dockerfile to include additional JARs or libraries required by your Spark applications.
- Development workflows -- Use Minikube's Docker daemon directly to avoid pushing images to a remote registry during development.
Theoretical Basis
The layered image construction follows a derivation pattern:
base_image(jars, scripts, entrypoint)
-> derive(python_image)
-> derive(r_image)
Cross-platform builds extend this with architecture multiplexing:
buildx(platforms=[amd64, arm64])
-> build_per_platform(base_image)
-> create_manifest_list
-> push
For development (non-release) builds, a temporary build context is created to avoid uploading the entire source tree to the Docker daemon, which can be very large due to test logs and build artifacts.
Image Architecture
| Image | Base | Added Content | Dockerfile |
|---|---|---|---|
spark:<tag> |
JDK base | Spark JARs, bin/, sbin/, entrypoint.sh | kubernetes/dockerfiles/spark/Dockerfile
|
spark-py:<tag> |
spark:<tag> |
Python runtime, PySpark libraries | kubernetes/dockerfiles/spark/bindings/python/Dockerfile
|
spark-r:<tag> |
spark:<tag> |
R runtime, SparkR packages | kubernetes/dockerfiles/spark/bindings/R/Dockerfile
|