Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Apache Spark K8s Container Patterns

From Leeroopedia




Knowledge Sources
Domains Kubernetes, Container_Operations, Debugging
Last Updated 2026-02-08 22:00 GMT

Overview

Kubernetes container patterns: handle anonymous UIDs by creating on-the-fly passwd entries, use SIGPWR for graceful decommissioning, add PWD to classpath (SPARK-43540), and use tini for proper signal hygiene.

Description

Running Spark on Kubernetes introduces several container-specific challenges that are addressed by patterns embedded in the entrypoint and decommission scripts. Kubernetes (especially OpenShift) may run pods with arbitrary UIDs not present in /etc/passwd, which breaks Java's user lookup. The graceful decommissioning process uses the SIGPWR signal to trigger Spark's internal decommission logic, and the decom.sh script uses `tail --pid` as an elegant process-wait mechanism. The tini init system is used as PID 1 to properly handle zombie processes and signal forwarding.

Usage

Use these patterns when debugging Kubernetes pod failures, customizing Spark container images, or implementing graceful shutdown for long-running streaming applications on Kubernetes. Also apply when troubleshooting classpath issues in containerized environments.

The Insight (Rule of Thumb)

  • Anonymous UID Handling: Create a passwd entry dynamically if the container runs with an arbitrary UID. Check if /etc/passwd is writable; degrade gracefully if not.
  • Signal Handling: Use SIGPWR (not SIGTERM) to trigger graceful decommissioning of Spark executors. This allows data migration before shutdown.
  • Process Waiting: Use `tail --pid=<PID> -f /dev/null` to block until a process exits (more elegant than polling with sleep loops).
  • Classpath: Always include the current working directory in the executor classpath (SPARK-43540 fix) for dynamic resource loading.
  • PID 1: Always run the main process under tini (`exec /usr/bin/tini -s -- "${CMD[@]}"`) for proper signal forwarding and zombie process reaping.
  • Error Handling: Use `set +e` before operations that may fail in non-fatal ways (getent lookup, decommission commands), then restore `set -e` after.

Reasoning

Anonymous UIDs: Kubernetes security contexts can assign arbitrary UIDs to pods (e.g., OpenShift's restricted SCC). Without a passwd entry, Java's `System.getProperty("user.name")` and related calls fail, breaking Spark's internal logging and HDFS client operations. The on-the-fly passwd entry creation is a zero-cost workaround.

SIGPWR for Decommissioning: SIGTERM triggers immediate shutdown, but Spark needs time to migrate cached data and in-flight shuffle blocks to other executors. SIGPWR is an unusual signal specifically chosen because it won't conflict with default signal handlers, allowing Spark's internal CoarseGrainedExecutorBackend to handle it as a decommission request.

Tini as PID 1: In containers, the entrypoint process becomes PID 1. Without tini, orphaned child processes become zombies (never reaped), and signals may not be forwarded correctly to the Spark process. The `-s` flag tells tini to forward signals to the child process group.

SPARK-43540: Some Spark operations (particularly those involving dynamic class loading or resource file access) expect files relative to the current working directory to be resolvable via the classpath. Without PWD on the classpath, these operations silently fail in containers where the working directory may differ from the traditional Spark deployment layout.

Code Evidence

Anonymous UID handling from `entrypoint.sh:22-37`:

myuid=$(id -u)
mygid=$(id -g)
set +e
uidentry=$(getent passwd $myuid)
set -e

if [ -z "$uidentry" ] ; then
    if [ -w /etc/passwd ] ; then
        echo "$myuid:x:$myuid:$mygid:${SPARK_USER_NAME:-anonymous uid}:$SPARK_HOME:/bin/false" >> /etc/passwd
    else
        echo "Container ENTRYPOINT failed to add passwd entry for anonymous UID"
    fi
fi

Graceful decommissioning via SIGPWR from `decom.sh:20-40`:

set +e
WORKER_PID=$(ps -o pid,cmd -C java | grep Executor | tail -n 1 | awk '{print $1}')
kill -s SIGPWR ${WORKER_PID}

# Wait for worker to exit
tail --pid=${WORKER_PID} -f /dev/null

# Ensure final log messages flush
sleep 1
date
echo "Done"
sleep 1

SPARK-43540 classpath fix from `entrypoint.sh:78-79`:

# SPARK-43540: add current working directory into executor classpath
SPARK_CLASSPATH="$SPARK_CLASSPATH:$PWD"

Tini process management from `entrypoint.sh:117-118`:

# Execute the container CMD under tini for better hygiene
exec /usr/bin/tini -s -- "${CMD[@]}"

Hadoop classpath delegation from `entrypoint.sh:62-66`:

# Does not set SPARK_DIST_CLASSPATH if already set, to avoid overriding customizations
if [ -n "${HADOOP_HOME}" ] && [ -z "${SPARK_DIST_CLASSPATH}" ]; then
  export SPARK_DIST_CLASSPATH="$($HADOOP_HOME/bin/hadoop classpath)"
fi

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment