Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:Recommenders team Recommenders Spark Environment

From Leeroopedia


Knowledge Sources
Domains Infrastructure, Distributed_Computing, Spark
Last Updated 2026-02-10 00:00 GMT

Overview

Apache Spark 3.3+ environment with PySpark, Java (Temurin JDK 21), and PyArrow for distributed recommendation workflows including ALS matrix factorization.

Description

This environment provides the distributed computing stack for Spark-based recommendation pipelines. PySpark is an optional dependency guarded by `try/except ImportError` blocks throughout the codebase, allowing the library to function in pure-Python mode when Spark is unavailable. The `start_or_get_spark` utility configures a SparkSession with 10 GB default driver memory, 4 MB JVM stack size, and optional MMLSpark (SynapseML) integration.

Usage

Use this environment for ALS Spark Recommendation workflows, Spark-based evaluation (SparkRankingEvaluation, SparkRatingEvaluation), Spark data splitting, and the Spark benchmarking paths. Required when `recommenders[spark]` extra is installed.

System Requirements

Category Requirement Notes
OS Linux (primary), Windows, macOS Windows requires extra environment variables
Java JDK 21 (Temurin) Docker and DevContainer install Temurin JDK 21 via `apt`
RAM >= 10 GB Default Spark driver memory is 10 GB
CPUs >= 8 CI tests run on Azure Standard_A8m_v2 (8 vCPUs, 64 GiB)

Dependencies

Spark Python Packages

  • `pyspark` >= 3.3.0, < 4
  • `pyarrow` >= 10.0.1

System Packages

  • Java Development Kit (JDK 21 recommended, Temurin distribution)
  • Apache Spark runtime (bundled with PySpark or standalone)

Optional: MMLSpark / SynapseML

Credentials

The following environment variables must be set depending on the platform:

All platforms:

  • `PYSPARK_SUBMIT_ARGS`: Auto-set by `start_or_get_spark()` when packages/jars are specified

Windows (required):

  • `JAVA_HOME`: Path to JDK installation
  • `SPARK_HOME`: Path to Spark installation
  • `HADOOP_HOME`: Path to Hadoop winutils
  • `PYSPARK_PYTHON`: Path to Python executable
  • `PYSPARK_DRIVER_PYTHON`: Path to Python executable

macOS (required):

  • `PYSPARK_PYTHON`: Path to Python executable
  • `PYSPARK_DRIVER_PYTHON`: Path to Python executable

CI test environment:

  • `SPARK_HOME`: Must be unset in functional tests to avoid conflicts (removed via `os.environ.pop`)

Quick Install

# Install with Spark extras
pip install "recommenders[spark]"

# Or install Java separately (Ubuntu/Debian)
sudo apt-get install -y temurin-21-jdk

# DevContainer auto-installs via postCreateCommand:
# pip install -e .[dev,spark]

Code Evidence

PySpark optional import guard from `recommenders/utils/spark_utils.py:7-10`:

try:
    from pyspark.sql import SparkSession  # noqa: F401
except ImportError:
    pass  # skip this import if we are in pure python environment

Default Spark configuration from `recommenders/utils/spark_utils.py:19-22`:

def start_or_get_spark(
    app_name="Sample",
    url="local[*]",
    memory="10g",

JVM stack size configuration from `recommenders/utils/spark_utils.py:68-70`:

# Set larger stack size
spark_opts.append('config("spark.executor.extraJavaOptions", "-Xss4m")')
spark_opts.append('config("spark.driver.extraJavaOptions", "-Xss4m")')

MMLSpark package definition from `recommenders/utils/spark_utils.py:12-16`:

MMLSPARK_PACKAGE = "com.microsoft.azure:synapseml_2.12:0.9.5"
MMLSPARK_REPO = "https://mmlspark.azureedge.net/maven"
# We support Spark v3, but in case you wish to use v2, set
# MMLSPARK_PACKAGE = "com.microsoft.ml.spark:mmlspark_2.11:0.18.1"

CI test Spark environment setup from `tests/functional/examples/test_notebooks_pyspark.py:71-73`:

os.environ["PYSPARK_PYTHON"] = sys.executable
os.environ["PYSPARK_DRIVER_PYTHON"] = sys.executable
os.environ.pop("SPARK_HOME", None)

Common Errors

Error Message Cause Solution
`ImportError: pyspark not found` PySpark not installed `pip install "recommenders[spark]"`
`JAVA_HOME is not set` Java not installed or JAVA_HOME not configured Install JDK 21 and set JAVA_HOME
Spark test flaky failures ALS PySpark tests are known to be flaky Tests use `@pytest.mark.flaky(reruns=5, reruns_delay=2)`
`StackOverflowError` in Spark Default JVM stack size too small Auto-handled by `start_or_get_spark()` which sets `-Xss4m`

Compatibility Notes

  • Spark v3 only: The codebase targets Spark 3.x. Spark 2.x requires changing the MMLSpark package to `com.microsoft.ml.spark:mmlspark_2.11:0.18.1`.
  • Databricks: Tested on Databricks Runtime 12.2-15.4 LTS (Spark 3.3.2-3.5.0). Requires manual installation of `numpy<2.0.0`, `pandera<=0.18.3`, `scipy<=1.13.1`.
  • Windows: Requires five environment variables (JAVA_HOME, SPARK_HOME, HADOOP_HOME, PYSPARK_PYTHON, PYSPARK_DRIVER_PYTHON).
  • macOS: Requires PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON to be set.
  • CI: Spark tests run on Azure Standard_A8m_v2 (8 vCPUs, 64 GiB memory). PySpark functional tests use `@pytest.mark.skipif(sys.platform == "win32")` to skip on Windows.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment