Environment:Recommenders team Recommenders Spark Environment

Knowledge Sources	Recommenders spark_utils.py SETUP.md
Domains	Infrastructure, Distributed_Computing, Spark
Last Updated	2026-02-10 00:00 GMT

Overview

Apache Spark 3.3+ environment with PySpark, Java (Temurin JDK 21), and PyArrow for distributed recommendation workflows including ALS matrix factorization.

Description

This environment provides the distributed computing stack for Spark-based recommendation pipelines. PySpark is an optional dependency guarded by `try/except ImportError` blocks throughout the codebase, allowing the library to function in pure-Python mode when Spark is unavailable. The `start_or_get_spark` utility configures a SparkSession with 10 GB default driver memory, 4 MB JVM stack size, and optional MMLSpark (SynapseML) integration.

Usage

Use this environment for ALS Spark Recommendation workflows, Spark-based evaluation (SparkRankingEvaluation, SparkRatingEvaluation), Spark data splitting, and the Spark benchmarking paths. Required when `recommenders[spark]` extra is installed.

System Requirements

Category	Requirement	Notes
OS	Linux (primary), Windows, macOS	Windows requires extra environment variables
Java	JDK 21 (Temurin)	Docker and DevContainer install Temurin JDK 21 via `apt`
RAM	>= 10 GB	Default Spark driver memory is 10 GB
CPUs	>= 8	CI tests run on Azure Standard_A8m_v2 (8 vCPUs, 64 GiB)

Dependencies

Spark Python Packages

`pyspark` >= 3.3.0, < 4
`pyarrow` >= 10.0.1

System Packages

Java Development Kit (JDK 21 recommended, Temurin distribution)
Apache Spark runtime (bundled with PySpark or standalone)

Optional: MMLSpark / SynapseML

`com.microsoft.azure:synapseml_2.12:0.9.5` (Maven package)
Repository: `https://mmlspark.azureedge.net/maven`

Credentials

The following environment variables must be set depending on the platform:

All platforms:

`PYSPARK_SUBMIT_ARGS`: Auto-set by `start_or_get_spark()` when packages/jars are specified

Windows (required):

`JAVA_HOME`: Path to JDK installation
`SPARK_HOME`: Path to Spark installation
`HADOOP_HOME`: Path to Hadoop winutils
`PYSPARK_PYTHON`: Path to Python executable
`PYSPARK_DRIVER_PYTHON`: Path to Python executable

macOS (required):

`PYSPARK_PYTHON`: Path to Python executable
`PYSPARK_DRIVER_PYTHON`: Path to Python executable

CI test environment:

`SPARK_HOME`: Must be unset in functional tests to avoid conflicts (removed via `os.environ.pop`)

Quick Install

# Install with Spark extras
pip install "recommenders[spark]"

# Or install Java separately (Ubuntu/Debian)
sudo apt-get install -y temurin-21-jdk

# DevContainer auto-installs via postCreateCommand:
# pip install -e .[dev,spark]

Code Evidence

PySpark optional import guard from `recommenders/utils/spark_utils.py:7-10`:

try:
    from pyspark.sql import SparkSession  # noqa: F401
except ImportError:
    pass  # skip this import if we are in pure python environment

Default Spark configuration from `recommenders/utils/spark_utils.py:19-22`:

def start_or_get_spark(
    app_name="Sample",
    url="local[*]",
    memory="10g",

JVM stack size configuration from `recommenders/utils/spark_utils.py:68-70`:

# Set larger stack size
spark_opts.append('config("spark.executor.extraJavaOptions", "-Xss4m")')
spark_opts.append('config("spark.driver.extraJavaOptions", "-Xss4m")')

MMLSpark package definition from `recommenders/utils/spark_utils.py:12-16`:

MMLSPARK_PACKAGE = "com.microsoft.azure:synapseml_2.12:0.9.5"
MMLSPARK_REPO = "https://mmlspark.azureedge.net/maven"
# We support Spark v3, but in case you wish to use v2, set
# MMLSPARK_PACKAGE = "com.microsoft.ml.spark:mmlspark_2.11:0.18.1"

CI test Spark environment setup from `tests/functional/examples/test_notebooks_pyspark.py:71-73`:

os.environ["PYSPARK_PYTHON"] = sys.executable
os.environ["PYSPARK_DRIVER_PYTHON"] = sys.executable
os.environ.pop("SPARK_HOME", None)

Common Errors

Error Message	Cause	Solution
`ImportError: pyspark not found`	PySpark not installed	`pip install "recommenders[spark]"`
`JAVA_HOME is not set`	Java not installed or JAVA_HOME not configured	Install JDK 21 and set JAVA_HOME
Spark test flaky failures	ALS PySpark tests are known to be flaky	Tests use `@pytest.mark.flaky(reruns=5, reruns_delay=2)`
`StackOverflowError` in Spark	Default JVM stack size too small	Auto-handled by `start_or_get_spark()` which sets `-Xss4m`

Compatibility Notes

Spark v3 only: The codebase targets Spark 3.x. Spark 2.x requires changing the MMLSpark package to `com.microsoft.ml.spark:mmlspark_2.11:0.18.1`.
Databricks: Tested on Databricks Runtime 12.2-15.4 LTS (Spark 3.3.2-3.5.0). Requires manual installation of `numpy<2.0.0`, `pandera<=0.18.3`, `scipy<=1.13.1`.
Windows: Requires five environment variables (JAVA_HOME, SPARK_HOME, HADOOP_HOME, PYSPARK_PYTHON, PYSPARK_DRIVER_PYTHON).
macOS: Requires PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON to be set.
CI: Spark tests run on Azure Standard_A8m_v2 (8 vCPUs, 64 GiB memory). PySpark functional tests use `@pytest.mark.skipif(sys.platform == "win32")` to skip on Windows.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment