Implementation:Recommenders team Recommenders Start Or Get Spark

Knowledge Sources	Recommenders
Domains	Distributed Computing, Infrastructure
Last Updated	2026-02-10 00:00 GMT

Overview

Concrete tool for initializing or retrieving an Apache Spark session configured for recommendation workloads.

Description

The start_or_get_spark function provides a single-call interface for creating a fully configured pyspark.sql.SparkSession. It handles three concerns that are typically spread across multiple configuration steps:

Package injection: Assembles --packages, --jars, and --repositories flags into the PYSPARK_SUBMIT_ARGS environment variable before the JVM starts.
Configuration merging: Applies an arbitrary dictionary of Spark configuration key-value pairs via the builder pattern, with sensible defaults for driver memory (10g) and JVM stack size (-Xss4m).
Session reuse: Calls getOrCreate() so that repeated invocations in notebook cells return the existing session rather than failing or creating duplicates.

The function uses Python's eval() to dynamically construct the builder chain from the configuration parameters, allowing maximum flexibility in the set of options passed through.

Usage

Call this function at the top of any Spark-based recommendation script or notebook. Pass the returned SparkSession object to downstream functions such as load_spark_df, ALS.fit(), and evaluation classes. In Databricks environments, a session is typically pre-created, but this function can still be used to ensure specific configuration settings are applied.

Code Reference

Source Location

Repository: recommenders
File: recommenders/utils/spark_utils.py (Lines 19-73)

Signature

def start_or_get_spark(
    app_name="Sample",
    url="local[*]",
    memory="10g",
    config=None,
    packages=None,
    jars=None,
    repositories=None,
) -> pyspark.sql.SparkSession

Import

from recommenders.utils.spark_utils import start_or_get_spark

I/O Contract

Inputs

Name	Type	Required	Description
app_name	str	No (default: "Sample")	Application name shown in the Spark UI and cluster manager logs
url	str	No (default: "local[*]")	Spark master URL; `"local[*]"` for local mode, `"spark://host:port"` for standalone cluster, `"yarn"` for YARN
memory	str	No (default: "10g")	Driver memory allocation; ignored if `spark.driver.memory` is set in `config`
config	dict	No (default: None)	Dictionary of Spark configuration key-value pairs, e.g. `{"spark.sql.shuffle.partitions": "200"}`
packages	list	No (default: None)	List of Maven coordinates to install, e.g. `["com.microsoft.azure:synapseml_2.12:0.9.5"]`
jars	list	No (default: None)	List of local JAR file paths to add to the classpath
repositories	list	No (default: None)	List of Maven repository URLs to search for packages

Outputs

Name	Type	Description
spark	pyspark.sql.SparkSession	Configured Spark session ready for distributed computation

Usage Examples

Basic Local Session

from recommenders.utils.spark_utils import start_or_get_spark

# Start a local Spark session with default settings
spark = start_or_get_spark(app_name="ALS_Recommender", memory="16g")

Session with External Packages

from recommenders.utils.spark_utils import start_or_get_spark

# Start with SynapseML package for Azure ML integrations
spark = start_or_get_spark(
    app_name="ALS_with_SynapseML",
    url="local[*]",
    memory="16g",
    packages=["com.microsoft.azure:synapseml_2.12:0.9.5"],
    repositories=["https://mmlspark.azureedge.net/maven"],
)

Session with Custom Configuration

from recommenders.utils.spark_utils import start_or_get_spark

spark = start_or_get_spark(
    app_name="ALS_Production",
    url="yarn",
    config={
        "spark.driver.memory": "32g",
        "spark.executor.memory": "16g",
        "spark.sql.shuffle.partitions": "400",
    },
)

Related Pages

Implements Principle

Principle:Recommenders_team_Recommenders_Spark_Session_Management

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment