Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Recommenders team Recommenders Spark Session Management

From Leeroopedia


Knowledge Sources
Domains Distributed Computing, Recommendation Systems, Infrastructure
Last Updated 2026-02-10 00:00 GMT

Overview

Managing distributed computing sessions involves initializing and reusing Spark sessions with proper memory, package, and configuration settings for scalable data processing.

Description

Apache Spark operates on a driver-executor architecture where a single SparkSession serves as the unified entry point to all Spark functionality. Before any distributed computation can take place, a session must be created (or an existing one retrieved) with the correct configuration:

  1. Application Identity: Each Spark application is identified by a name (e.g., "ALS_Recommender") that appears in the Spark UI and cluster manager logs. This name helps distinguish concurrent workloads on shared clusters.
  2. Master URL: The url parameter determines whether the application runs locally ("local[*]" uses all available CPU cores) or connects to a cluster manager (e.g., "spark://host:7077" for standalone, or "yarn" for Hadoop YARN).
  3. Memory Allocation: The driver process requires sufficient heap memory (controlled by spark.driver.memory) to hold broadcast variables, collected results, and metadata. Recommendation workloads typically require 10 GB or more due to large user-item matrices.
  4. Package Management: External Maven packages (e.g., SynapseML for Azure integrations) and JAR files can be injected at session creation time via PYSPARK_SUBMIT_ARGS.
  5. Session Reuse: Spark follows a singleton pattern where getOrCreate() returns the existing session if one is already active with the same application name, avoiding the overhead of re-initialization.

Proper session management is essential because misconfigured sessions lead to out-of-memory errors, package resolution failures, or silent performance degradation across the entire recommendation pipeline.

Usage

Use this principle at the very start of any Spark-based recommendation workflow. It is the first step before loading data, training models, or computing evaluation metrics. The session should be created once and passed to all downstream functions that require a Spark context.

Theoretical Basis

The SparkSession abstraction unifies the older SparkContext, SQLContext, and HiveContext into a single entry point. The initialization follows a builder pattern:

1. Set application name and master URL
2. Apply configuration key-value pairs:
   - spark.driver.memory = "10g"
   - spark.executor.extraJavaOptions = "-Xss4m"
   - spark.driver.extraJavaOptions = "-Xss4m"
3. Inject external packages via PYSPARK_SUBMIT_ARGS environment variable:
   - --packages com.microsoft.azure:synapseml_2.12:0.9.5
   - --jars /path/to/custom.jar
   - --repositories https://mmlspark.azureedge.net/maven
4. Call getOrCreate() to either:
   - Create a new session with the specified configuration, OR
   - Return the existing session if one is already active

The getOrCreate() pattern is critical for notebook environments (Jupyter, Databricks) where cells may be re-executed. Without it, repeated initialization would either fail or create conflicting sessions. The local[*] master URL tells Spark to run in local mode using all available processor cores, which is sufficient for prototyping and moderate-scale datasets.

Stack size configuration (-Xss4m) is set for both driver and executor JVMs to prevent StackOverflowError in deeply recursive operations that can occur during Spark's query plan optimization.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment