Environment:Apache Spark Python Environment
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, PySpark |
| Last Updated | 2026-02-08 22:00 GMT |
Overview
Python 3.10+ environment with PySpark dependencies including PyArrow >= 18.0.0, pandas >= 2.2.0, and NumPy >= 1.21 for running PySpark applications and tests.
Description
This environment provides the Python runtime and library dependencies needed for PySpark development, testing, and execution. Python 3.10 is the minimum supported version, with support extending through Python 3.14 and PyPy. The environment includes core data processing libraries (PyArrow, pandas, NumPy) and gRPC dependencies for Spark Connect. PySpark requires a valid JDK 17+ installation accessible via JAVA_HOME, as it bridges Python to the JVM.
Usage
Use this environment for any PySpark development, the Python_Run_Tests workflow, and when running PySpark applications. It is required whenever Python code interacts with the Spark runtime, including Spark Connect client applications.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux, macOS, or Windows (via WSL2) | Cross-platform Python support |
| Hardware | Standard workstation | No GPU required for base PySpark |
| Runtime | Python 3.10+ | Supports 3.10, 3.11, 3.12, 3.13, 3.14, CPython and PyPy |
| JVM | JDK 17+ with JAVA_HOME set | PySpark bridges to JVM via Py4J |
Dependencies
System Packages
- Python 3.10 or newer (3.11 is default for dev testing)
- JDK 17+ (with `JAVA_HOME` set)
Python Packages (Core)
- `py4j` >= 0.10.9.7 (JVM bridge)
- `pyarrow` >= 18.0.0 (columnar data exchange)
- `pandas` >= 2.2.0 (DataFrame operations)
- `numpy` >= 1.21 (numerical computing)
- `grpcio` >= 1.76.0 (Spark Connect)
- `googleapis-common-protos` >= 1.71.0 (Spark Connect)
- `pyyaml` >= 3.11 (configuration)
- `zstandard` >= 0.25.0 (compression)
Python Packages (Development/Testing)
See `dev/requirements.txt` for the full list of development dependencies.
Credentials
No credentials required for the base PySpark environment. Credentials may be needed for specific data sources (JDBC, cloud storage).
Quick Install
# Install PySpark with all dependencies
pip install pyspark[connect]
# Or install from source (after building Spark):
cd python && pip install -e .
# Verify installation
python3 -c "import pyspark; print(pyspark.__version__)"
Code Evidence
Python version check from `python/run-tests:24-28`:
PYTHON_VERSION_CHECK=$(python3 -c 'import sys; print(sys.version_info < (3, 10, 0))')
if [[ "$PYTHON_VERSION_CHECK" == "True" ]]; then
echo "Python versions prior to 3.10 are not supported."
exit -1
fi
Minimum Python requirement from `python/packaging/classic/setup.py:384`:
python_requires=">=3.10"
Core dependencies with versions from `python/packaging/classic/setup.py:153-159`:
# Minimum version requirements for PySpark dependencies
pandas >= 2.2.0
numpy >= 1.21
pyarrow >= 18.0.0
grpcio >= 1.76.0
googleapis-common-protos >= 1.71.0
pyyaml >= 3.11
zstandard >= 0.25.0
Python version mismatch detection from `python/pyspark/worker_util.py:82-85`:
# Checks that driver and worker Python versions match
# Raises error if sys.version_info doesn't match expected version
Environment variable configuration from `python/pyspark/core/context.py:343-344`:
self.pythonExec = os.environ.get("PYSPARK_PYTHON", "python3")
self.pythonVer = "%d.%d" % sys.version_info[:2]
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `Python versions prior to 3.10 are not supported` | Python < 3.10 detected | Upgrade to Python 3.10+ |
| `JAVA_HOME is not set` | JDK not found | Install JDK 17+ and `export JAVA_HOME=/path/to/jdk` |
| Python version mismatch between driver and worker | Different Python versions on driver vs executors | Set `PYSPARK_PYTHON` to same path on all nodes |
| `ImportError: No module named py4j` | py4j not installed | `pip install py4j` or install PySpark via pip |
Compatibility Notes
- PyPy: Supported alongside CPython. Performance characteristics differ.
- Python 3.14: Listed as a supported version in setup.py classifiers.
- PYSPARK_PYTHON: Environment variable controls which Python executable executors use. Must be consistent across all cluster nodes.
- PYSPARK_DRIVER_PYTHON: Separate variable for the driver Python executable, allowing different versions (e.g., Jupyter on driver).
- PYTHONHASHSEED: Set to 0 by `bin/spark-submit` to disable randomized hash for reproducibility.