Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:Apache Spark Python Environment

From Leeroopedia


Knowledge Sources
Domains Infrastructure, PySpark
Last Updated 2026-02-08 22:00 GMT

Overview

Python 3.10+ environment with PySpark dependencies including PyArrow >= 18.0.0, pandas >= 2.2.0, and NumPy >= 1.21 for running PySpark applications and tests.

Description

This environment provides the Python runtime and library dependencies needed for PySpark development, testing, and execution. Python 3.10 is the minimum supported version, with support extending through Python 3.14 and PyPy. The environment includes core data processing libraries (PyArrow, pandas, NumPy) and gRPC dependencies for Spark Connect. PySpark requires a valid JDK 17+ installation accessible via JAVA_HOME, as it bridges Python to the JVM.

Usage

Use this environment for any PySpark development, the Python_Run_Tests workflow, and when running PySpark applications. It is required whenever Python code interacts with the Spark runtime, including Spark Connect client applications.

System Requirements

Category Requirement Notes
OS Linux, macOS, or Windows (via WSL2) Cross-platform Python support
Hardware Standard workstation No GPU required for base PySpark
Runtime Python 3.10+ Supports 3.10, 3.11, 3.12, 3.13, 3.14, CPython and PyPy
JVM JDK 17+ with JAVA_HOME set PySpark bridges to JVM via Py4J

Dependencies

System Packages

  • Python 3.10 or newer (3.11 is default for dev testing)
  • JDK 17+ (with `JAVA_HOME` set)

Python Packages (Core)

  • `py4j` >= 0.10.9.7 (JVM bridge)
  • `pyarrow` >= 18.0.0 (columnar data exchange)
  • `pandas` >= 2.2.0 (DataFrame operations)
  • `numpy` >= 1.21 (numerical computing)
  • `grpcio` >= 1.76.0 (Spark Connect)
  • `googleapis-common-protos` >= 1.71.0 (Spark Connect)
  • `pyyaml` >= 3.11 (configuration)
  • `zstandard` >= 0.25.0 (compression)

Python Packages (Development/Testing)

See `dev/requirements.txt` for the full list of development dependencies.

Credentials

No credentials required for the base PySpark environment. Credentials may be needed for specific data sources (JDBC, cloud storage).

Quick Install

# Install PySpark with all dependencies
pip install pyspark[connect]

# Or install from source (after building Spark):
cd python && pip install -e .

# Verify installation
python3 -c "import pyspark; print(pyspark.__version__)"

Code Evidence

Python version check from `python/run-tests:24-28`:

PYTHON_VERSION_CHECK=$(python3 -c 'import sys; print(sys.version_info < (3, 10, 0))')
if [[ "$PYTHON_VERSION_CHECK" == "True" ]]; then
  echo "Python versions prior to 3.10 are not supported."
  exit -1
fi

Minimum Python requirement from `python/packaging/classic/setup.py:384`:

python_requires=">=3.10"

Core dependencies with versions from `python/packaging/classic/setup.py:153-159`:

# Minimum version requirements for PySpark dependencies
pandas >= 2.2.0
numpy >= 1.21
pyarrow >= 18.0.0
grpcio >= 1.76.0
googleapis-common-protos >= 1.71.0
pyyaml >= 3.11
zstandard >= 0.25.0

Python version mismatch detection from `python/pyspark/worker_util.py:82-85`:

# Checks that driver and worker Python versions match
# Raises error if sys.version_info doesn't match expected version

Environment variable configuration from `python/pyspark/core/context.py:343-344`:

self.pythonExec = os.environ.get("PYSPARK_PYTHON", "python3")
self.pythonVer = "%d.%d" % sys.version_info[:2]

Common Errors

Error Message Cause Solution
`Python versions prior to 3.10 are not supported` Python < 3.10 detected Upgrade to Python 3.10+
`JAVA_HOME is not set` JDK not found Install JDK 17+ and `export JAVA_HOME=/path/to/jdk`
Python version mismatch between driver and worker Different Python versions on driver vs executors Set `PYSPARK_PYTHON` to same path on all nodes
`ImportError: No module named py4j` py4j not installed `pip install py4j` or install PySpark via pip

Compatibility Notes

  • PyPy: Supported alongside CPython. Performance characteristics differ.
  • Python 3.14: Listed as a supported version in setup.py classifiers.
  • PYSPARK_PYTHON: Environment variable controls which Python executable executors use. Must be consistent across all cluster nodes.
  • PYSPARK_DRIVER_PYTHON: Separate variable for the driver Python executable, allowing different versions (e.g., Jupyter on driver).
  • PYTHONHASHSEED: Set to 0 by `bin/spark-submit` to disable randomized hash for reproducibility.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment