Principle:DataExpert io Data engineer handbook Spark Session Configuration
Overview
SparkSession serves as the unified entry point to all Spark functionality. It replaced the older SparkContext and SQLContext interfaces, consolidating them into a single object that provides access to DataFrame and SQL operations, configuration management, and catalog interactions.
Theory
The SparkSession follows the builder pattern for construction, allowing callers to chain configuration methods before creating the session. Key configuration options include:
- master - specifies the cluster manager URL (e.g.,
local,yarn,spark://host:port) - appName - sets a human-readable name for the Spark application, visible in the Spark UI
- config - sets arbitrary Spark configuration key-value pairs (e.g., catalog implementations, warehouse directories)
The builder culminates in a call to getOrCreate(), which either creates a new session or returns an existing one.
When to Apply
Any PySpark application must create a SparkSession before performing data operations. This is always the first step in:
- Batch ETL jobs
- Interactive notebook sessions
- Streaming applications
- Ad-hoc SQL queries against catalog-managed tables
Theoretical Basis
The SparkSession relies on two design patterns:
- Lazy Initialization - the session and its underlying SparkContext are not created until
getOrCreate()is invoked, allowing all configuration to be set beforehand - Singleton Pattern - within a single JVM,
getOrCreate()returns the same SparkSession instance if one already exists, preventing resource duplication
# Conceptual pattern
spark = (SparkSession.builder
.master("local")
.appName("my_app")
.getOrCreate())