Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Recommenders team Recommenders Start Or Get Spark

From Leeroopedia


Knowledge Sources
Domains Distributed Computing, Infrastructure
Last Updated 2026-02-10 00:00 GMT

Overview

Concrete tool for initializing or retrieving an Apache Spark session configured for recommendation workloads.

Description

The start_or_get_spark function provides a single-call interface for creating a fully configured pyspark.sql.SparkSession. It handles three concerns that are typically spread across multiple configuration steps:

  1. Package injection: Assembles --packages, --jars, and --repositories flags into the PYSPARK_SUBMIT_ARGS environment variable before the JVM starts.
  2. Configuration merging: Applies an arbitrary dictionary of Spark configuration key-value pairs via the builder pattern, with sensible defaults for driver memory (10g) and JVM stack size (-Xss4m).
  3. Session reuse: Calls getOrCreate() so that repeated invocations in notebook cells return the existing session rather than failing or creating duplicates.

The function uses Python's eval() to dynamically construct the builder chain from the configuration parameters, allowing maximum flexibility in the set of options passed through.

Usage

Call this function at the top of any Spark-based recommendation script or notebook. Pass the returned SparkSession object to downstream functions such as load_spark_df, ALS.fit(), and evaluation classes. In Databricks environments, a session is typically pre-created, but this function can still be used to ensure specific configuration settings are applied.

Code Reference

Source Location

  • Repository: recommenders
  • File: recommenders/utils/spark_utils.py (Lines 19-73)

Signature

def start_or_get_spark(
    app_name="Sample",
    url="local[*]",
    memory="10g",
    config=None,
    packages=None,
    jars=None,
    repositories=None,
) -> pyspark.sql.SparkSession

Import

from recommenders.utils.spark_utils import start_or_get_spark

I/O Contract

Inputs

Name Type Required Description
app_name str No (default: "Sample") Application name shown in the Spark UI and cluster manager logs
url str No (default: "local[*]") Spark master URL; "local[*]" for local mode, "spark://host:port" for standalone cluster, "yarn" for YARN
memory str No (default: "10g") Driver memory allocation; ignored if spark.driver.memory is set in config
config dict No (default: None) Dictionary of Spark configuration key-value pairs, e.g. {"spark.sql.shuffle.partitions": "200"}
packages list No (default: None) List of Maven coordinates to install, e.g. ["com.microsoft.azure:synapseml_2.12:0.9.5"]
jars list No (default: None) List of local JAR file paths to add to the classpath
repositories list No (default: None) List of Maven repository URLs to search for packages

Outputs

Name Type Description
spark pyspark.sql.SparkSession Configured Spark session ready for distributed computation

Usage Examples

Basic Local Session

from recommenders.utils.spark_utils import start_or_get_spark

# Start a local Spark session with default settings
spark = start_or_get_spark(app_name="ALS_Recommender", memory="16g")

Session with External Packages

from recommenders.utils.spark_utils import start_or_get_spark

# Start with SynapseML package for Azure ML integrations
spark = start_or_get_spark(
    app_name="ALS_with_SynapseML",
    url="local[*]",
    memory="16g",
    packages=["com.microsoft.azure:synapseml_2.12:0.9.5"],
    repositories=["https://mmlspark.azureedge.net/maven"],
)

Session with Custom Configuration

from recommenders.utils.spark_utils import start_or_get_spark

spark = start_or_get_spark(
    app_name="ALS_Production",
    url="yarn",
    config={
        "spark.driver.memory": "32g",
        "spark.executor.memory": "16g",
        "spark.sql.shuffle.partitions": "400",
    },
)

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment