Principle:Recommenders team Recommenders Spark Session Management

Knowledge Sources	Recommenders Apache Spark Documentation
Domains	Distributed Computing, Recommendation Systems, Infrastructure
Last Updated	2026-02-10 00:00 GMT

Overview

Managing distributed computing sessions involves initializing and reusing Spark sessions with proper memory, package, and configuration settings for scalable data processing.

Description

Apache Spark operates on a driver-executor architecture where a single SparkSession serves as the unified entry point to all Spark functionality. Before any distributed computation can take place, a session must be created (or an existing one retrieved) with the correct configuration:

Application Identity: Each Spark application is identified by a name (e.g., "ALS_Recommender") that appears in the Spark UI and cluster manager logs. This name helps distinguish concurrent workloads on shared clusters.
Master URL: The url parameter determines whether the application runs locally ("local[*]" uses all available CPU cores) or connects to a cluster manager (e.g., "spark://host:7077" for standalone, or "yarn" for Hadoop YARN).
Memory Allocation: The driver process requires sufficient heap memory (controlled by spark.driver.memory) to hold broadcast variables, collected results, and metadata. Recommendation workloads typically require 10 GB or more due to large user-item matrices.
Package Management: External Maven packages (e.g., SynapseML for Azure integrations) and JAR files can be injected at session creation time via PYSPARK_SUBMIT_ARGS.
Session Reuse: Spark follows a singleton pattern where getOrCreate() returns the existing session if one is already active with the same application name, avoiding the overhead of re-initialization.

Proper session management is essential because misconfigured sessions lead to out-of-memory errors, package resolution failures, or silent performance degradation across the entire recommendation pipeline.

Usage

Use this principle at the very start of any Spark-based recommendation workflow. It is the first step before loading data, training models, or computing evaluation metrics. The session should be created once and passed to all downstream functions that require a Spark context.

Theoretical Basis

The SparkSession abstraction unifies the older SparkContext, SQLContext, and HiveContext into a single entry point. The initialization follows a builder pattern:

1. Set application name and master URL
2. Apply configuration key-value pairs:
   - spark.driver.memory = "10g"
   - spark.executor.extraJavaOptions = "-Xss4m"
   - spark.driver.extraJavaOptions = "-Xss4m"
3. Inject external packages via PYSPARK_SUBMIT_ARGS environment variable:
   - --packages com.microsoft.azure:synapseml_2.12:0.9.5
   - --jars /path/to/custom.jar
   - --repositories https://mmlspark.azureedge.net/maven
4. Call getOrCreate() to either:
   - Create a new session with the specified configuration, OR
   - Return the existing session if one is already active

The getOrCreate() pattern is critical for notebook environments (Jupyter, Databricks) where cells may be re-executed. Without it, repeated initialization would either fail or create conflicting sessions. The local[*] master URL tells Spark to run in local mode using all available processor cores, which is sufficient for prototyping and moderate-scale datasets.

Stack size configuration (-Xss4m) is set for both driver and executor JVMs to prevent StackOverflowError in deeply recursive operations that can occur during Spark's query plan optimization.

Related Pages

Implemented By

Implementation:Recommenders_team_Recommenders_Start_Or_Get_Spark

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment