Environment:Apache Spark Python Environment

Knowledge Sources	Apache Spark PySpark Installation
Domains	Infrastructure, PySpark
Last Updated	2026-02-08 22:00 GMT

Overview

Python 3.10+ environment with PySpark dependencies including PyArrow >= 18.0.0, pandas >= 2.2.0, and NumPy >= 1.21 for running PySpark applications and tests.

Description

This environment provides the Python runtime and library dependencies needed for PySpark development, testing, and execution. Python 3.10 is the minimum supported version, with support extending through Python 3.14 and PyPy. The environment includes core data processing libraries (PyArrow, pandas, NumPy) and gRPC dependencies for Spark Connect. PySpark requires a valid JDK 17+ installation accessible via JAVA_HOME, as it bridges Python to the JVM.

Usage

Use this environment for any PySpark development, the Python_Run_Tests workflow, and when running PySpark applications. It is required whenever Python code interacts with the Spark runtime, including Spark Connect client applications.

System Requirements

Category	Requirement	Notes
OS	Linux, macOS, or Windows (via WSL2)	Cross-platform Python support
Hardware	Standard workstation	No GPU required for base PySpark
Runtime	Python 3.10+	Supports 3.10, 3.11, 3.12, 3.13, 3.14, CPython and PyPy
JVM	JDK 17+ with JAVA_HOME set	PySpark bridges to JVM via Py4J

Dependencies

System Packages

Python 3.10 or newer (3.11 is default for dev testing)
JDK 17+ (with `JAVA_HOME` set)

Python Packages (Core)

`py4j` >= 0.10.9.7 (JVM bridge)
`pyarrow` >= 18.0.0 (columnar data exchange)
`pandas` >= 2.2.0 (DataFrame operations)
`numpy` >= 1.21 (numerical computing)
`grpcio` >= 1.76.0 (Spark Connect)
`googleapis-common-protos` >= 1.71.0 (Spark Connect)
`pyyaml` >= 3.11 (configuration)
`zstandard` >= 0.25.0 (compression)

Python Packages (Development/Testing)

See `dev/requirements.txt` for the full list of development dependencies.

Credentials

No credentials required for the base PySpark environment. Credentials may be needed for specific data sources (JDBC, cloud storage).

Quick Install

# Install PySpark with all dependencies
pip install pyspark[connect]

# Or install from source (after building Spark):
cd python && pip install -e .

# Verify installation
python3 -c "import pyspark; print(pyspark.__version__)"

Code Evidence

Python version check from `python/run-tests:24-28`:

PYTHON_VERSION_CHECK=$(python3 -c 'import sys; print(sys.version_info < (3, 10, 0))')
if [[ "$PYTHON_VERSION_CHECK" == "True" ]]; then
  echo "Python versions prior to 3.10 are not supported."
  exit -1
fi

Minimum Python requirement from `python/packaging/classic/setup.py:384`:

python_requires=">=3.10"

Core dependencies with versions from `python/packaging/classic/setup.py:153-159`:

# Minimum version requirements for PySpark dependencies
pandas >= 2.2.0
numpy >= 1.21
pyarrow >= 18.0.0
grpcio >= 1.76.0
googleapis-common-protos >= 1.71.0
pyyaml >= 3.11
zstandard >= 0.25.0

Python version mismatch detection from `python/pyspark/worker_util.py:82-85`:

# Checks that driver and worker Python versions match
# Raises error if sys.version_info doesn't match expected version

Environment variable configuration from `python/pyspark/core/context.py:343-344`:

self.pythonExec = os.environ.get("PYSPARK_PYTHON", "python3")
self.pythonVer = "%d.%d" % sys.version_info[:2]

Common Errors

Error Message	Cause	Solution
`Python versions prior to 3.10 are not supported`	Python < 3.10 detected	Upgrade to Python 3.10+
`JAVA_HOME is not set`	JDK not found	Install JDK 17+ and `export JAVA_HOME=/path/to/jdk`
Python version mismatch between driver and worker	Different Python versions on driver vs executors	Set `PYSPARK_PYTHON` to same path on all nodes
`ImportError: No module named py4j`	py4j not installed	`pip install py4j` or install PySpark via pip

Compatibility Notes

PyPy: Supported alongside CPython. Performance characteristics differ.
Python 3.14: Listed as a supported version in setup.py classifiers.
PYSPARK_PYTHON: Environment variable controls which Python executable executors use. Must be consistent across all cluster nodes.
PYSPARK_DRIVER_PYTHON: Separate variable for the driver Python executable, allowing different versions (e.g., Jupyter on driver).
PYTHONHASHSEED: Set to 0 by `bin/spark-submit` to disable randomized hash for reproducibility.

Related Pages

Implementation:Apache_Spark_Python_Run_Tests

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment