Implementation:Recommenders team Recommenders Start Or Get Spark
| Knowledge Sources | |
|---|---|
| Domains | Distributed Computing, Infrastructure |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Concrete tool for initializing or retrieving an Apache Spark session configured for recommendation workloads.
Description
The start_or_get_spark function provides a single-call interface for creating a fully configured pyspark.sql.SparkSession. It handles three concerns that are typically spread across multiple configuration steps:
- Package injection: Assembles
--packages,--jars, and--repositoriesflags into thePYSPARK_SUBMIT_ARGSenvironment variable before the JVM starts. - Configuration merging: Applies an arbitrary dictionary of Spark configuration key-value pairs via the builder pattern, with sensible defaults for driver memory (
10g) and JVM stack size (-Xss4m). - Session reuse: Calls
getOrCreate()so that repeated invocations in notebook cells return the existing session rather than failing or creating duplicates.
The function uses Python's eval() to dynamically construct the builder chain from the configuration parameters, allowing maximum flexibility in the set of options passed through.
Usage
Call this function at the top of any Spark-based recommendation script or notebook. Pass the returned SparkSession object to downstream functions such as load_spark_df, ALS.fit(), and evaluation classes. In Databricks environments, a session is typically pre-created, but this function can still be used to ensure specific configuration settings are applied.
Code Reference
Source Location
- Repository: recommenders
- File: recommenders/utils/spark_utils.py (Lines 19-73)
Signature
def start_or_get_spark(
app_name="Sample",
url="local[*]",
memory="10g",
config=None,
packages=None,
jars=None,
repositories=None,
) -> pyspark.sql.SparkSession
Import
from recommenders.utils.spark_utils import start_or_get_spark
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| app_name | str | No (default: "Sample") | Application name shown in the Spark UI and cluster manager logs |
| url | str | No (default: "local[*]") | Spark master URL; "local[*]" for local mode, "spark://host:port" for standalone cluster, "yarn" for YARN
|
| memory | str | No (default: "10g") | Driver memory allocation; ignored if spark.driver.memory is set in config
|
| config | dict | No (default: None) | Dictionary of Spark configuration key-value pairs, e.g. {"spark.sql.shuffle.partitions": "200"}
|
| packages | list | No (default: None) | List of Maven coordinates to install, e.g. ["com.microsoft.azure:synapseml_2.12:0.9.5"]
|
| jars | list | No (default: None) | List of local JAR file paths to add to the classpath |
| repositories | list | No (default: None) | List of Maven repository URLs to search for packages |
Outputs
| Name | Type | Description |
|---|---|---|
| spark | pyspark.sql.SparkSession | Configured Spark session ready for distributed computation |
Usage Examples
Basic Local Session
from recommenders.utils.spark_utils import start_or_get_spark
# Start a local Spark session with default settings
spark = start_or_get_spark(app_name="ALS_Recommender", memory="16g")
Session with External Packages
from recommenders.utils.spark_utils import start_or_get_spark
# Start with SynapseML package for Azure ML integrations
spark = start_or_get_spark(
app_name="ALS_with_SynapseML",
url="local[*]",
memory="16g",
packages=["com.microsoft.azure:synapseml_2.12:0.9.5"],
repositories=["https://mmlspark.azureedge.net/maven"],
)
Session with Custom Configuration
from recommenders.utils.spark_utils import start_or_get_spark
spark = start_or_get_spark(
app_name="ALS_Production",
url="yarn",
config={
"spark.driver.memory": "32g",
"spark.executor.memory": "16g",
"spark.sql.shuffle.partitions": "400",
},
)