Environment:Recommenders team Recommenders Spark Environment
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, Distributed_Computing, Spark |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Apache Spark 3.3+ environment with PySpark, Java (Temurin JDK 21), and PyArrow for distributed recommendation workflows including ALS matrix factorization.
Description
This environment provides the distributed computing stack for Spark-based recommendation pipelines. PySpark is an optional dependency guarded by `try/except ImportError` blocks throughout the codebase, allowing the library to function in pure-Python mode when Spark is unavailable. The `start_or_get_spark` utility configures a SparkSession with 10 GB default driver memory, 4 MB JVM stack size, and optional MMLSpark (SynapseML) integration.
Usage
Use this environment for ALS Spark Recommendation workflows, Spark-based evaluation (SparkRankingEvaluation, SparkRatingEvaluation), Spark data splitting, and the Spark benchmarking paths. Required when `recommenders[spark]` extra is installed.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux (primary), Windows, macOS | Windows requires extra environment variables |
| Java | JDK 21 (Temurin) | Docker and DevContainer install Temurin JDK 21 via `apt` |
| RAM | >= 10 GB | Default Spark driver memory is 10 GB |
| CPUs | >= 8 | CI tests run on Azure Standard_A8m_v2 (8 vCPUs, 64 GiB) |
Dependencies
Spark Python Packages
- `pyspark` >= 3.3.0, < 4
- `pyarrow` >= 10.0.1
System Packages
- Java Development Kit (JDK 21 recommended, Temurin distribution)
- Apache Spark runtime (bundled with PySpark or standalone)
Optional: MMLSpark / SynapseML
- `com.microsoft.azure:synapseml_2.12:0.9.5` (Maven package)
- Repository: `https://mmlspark.azureedge.net/maven`
Credentials
The following environment variables must be set depending on the platform:
All platforms:
- `PYSPARK_SUBMIT_ARGS`: Auto-set by `start_or_get_spark()` when packages/jars are specified
Windows (required):
- `JAVA_HOME`: Path to JDK installation
- `SPARK_HOME`: Path to Spark installation
- `HADOOP_HOME`: Path to Hadoop winutils
- `PYSPARK_PYTHON`: Path to Python executable
- `PYSPARK_DRIVER_PYTHON`: Path to Python executable
macOS (required):
- `PYSPARK_PYTHON`: Path to Python executable
- `PYSPARK_DRIVER_PYTHON`: Path to Python executable
CI test environment:
- `SPARK_HOME`: Must be unset in functional tests to avoid conflicts (removed via `os.environ.pop`)
Quick Install
# Install with Spark extras
pip install "recommenders[spark]"
# Or install Java separately (Ubuntu/Debian)
sudo apt-get install -y temurin-21-jdk
# DevContainer auto-installs via postCreateCommand:
# pip install -e .[dev,spark]
Code Evidence
PySpark optional import guard from `recommenders/utils/spark_utils.py:7-10`:
try:
from pyspark.sql import SparkSession # noqa: F401
except ImportError:
pass # skip this import if we are in pure python environment
Default Spark configuration from `recommenders/utils/spark_utils.py:19-22`:
def start_or_get_spark(
app_name="Sample",
url="local[*]",
memory="10g",
JVM stack size configuration from `recommenders/utils/spark_utils.py:68-70`:
# Set larger stack size
spark_opts.append('config("spark.executor.extraJavaOptions", "-Xss4m")')
spark_opts.append('config("spark.driver.extraJavaOptions", "-Xss4m")')
MMLSpark package definition from `recommenders/utils/spark_utils.py:12-16`:
MMLSPARK_PACKAGE = "com.microsoft.azure:synapseml_2.12:0.9.5"
MMLSPARK_REPO = "https://mmlspark.azureedge.net/maven"
# We support Spark v3, but in case you wish to use v2, set
# MMLSPARK_PACKAGE = "com.microsoft.ml.spark:mmlspark_2.11:0.18.1"
CI test Spark environment setup from `tests/functional/examples/test_notebooks_pyspark.py:71-73`:
os.environ["PYSPARK_PYTHON"] = sys.executable
os.environ["PYSPARK_DRIVER_PYTHON"] = sys.executable
os.environ.pop("SPARK_HOME", None)
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `ImportError: pyspark not found` | PySpark not installed | `pip install "recommenders[spark]"` |
| `JAVA_HOME is not set` | Java not installed or JAVA_HOME not configured | Install JDK 21 and set JAVA_HOME |
| Spark test flaky failures | ALS PySpark tests are known to be flaky | Tests use `@pytest.mark.flaky(reruns=5, reruns_delay=2)` |
| `StackOverflowError` in Spark | Default JVM stack size too small | Auto-handled by `start_or_get_spark()` which sets `-Xss4m` |
Compatibility Notes
- Spark v3 only: The codebase targets Spark 3.x. Spark 2.x requires changing the MMLSpark package to `com.microsoft.ml.spark:mmlspark_2.11:0.18.1`.
- Databricks: Tested on Databricks Runtime 12.2-15.4 LTS (Spark 3.3.2-3.5.0). Requires manual installation of `numpy<2.0.0`, `pandera<=0.18.3`, `scipy<=1.13.1`.
- Windows: Requires five environment variables (JAVA_HOME, SPARK_HOME, HADOOP_HOME, PYSPARK_PYTHON, PYSPARK_DRIVER_PYTHON).
- macOS: Requires PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON to be set.
- CI: Spark tests run on Azure Standard_A8m_v2 (8 vCPUs, 64 GiB memory). PySpark functional tests use `@pytest.mark.skipif(sys.platform == "win32")` to skip on Windows.
Related Pages
- Implementation:Recommenders_team_Recommenders_Start_Or_Get_Spark
- Implementation:Recommenders_team_Recommenders_Load_Spark_Df
- Implementation:Recommenders_team_Recommenders_Spark_Random_Split
- Implementation:Recommenders_team_Recommenders_PySpark_ALS
- Implementation:Recommenders_team_Recommenders_ALSModel_Transform
- Implementation:Recommenders_team_Recommenders_Spark_Evaluation_Classes