Principle:DataExpert io Data engineer handbook Spark Session Configuration

Overview

SparkSession serves as the unified entry point to all Spark functionality. It replaced the older SparkContext and SQLContext interfaces, consolidating them into a single object that provides access to DataFrame and SQL operations, configuration management, and catalog interactions.

Theory

The SparkSession follows the builder pattern for construction, allowing callers to chain configuration methods before creating the session. Key configuration options include:

master - specifies the cluster manager URL (e.g., local, yarn, spark://host:port)
appName - sets a human-readable name for the Spark application, visible in the Spark UI
config - sets arbitrary Spark configuration key-value pairs (e.g., catalog implementations, warehouse directories)

The builder culminates in a call to getOrCreate(), which either creates a new session or returns an existing one.

When to Apply

Any PySpark application must create a SparkSession before performing data operations. This is always the first step in:

Batch ETL jobs
Interactive notebook sessions
Streaming applications
Ad-hoc SQL queries against catalog-managed tables

Theoretical Basis

The SparkSession relies on two design patterns:

Lazy Initialization - the session and its underlying SparkContext are not created until getOrCreate() is invoked, allowing all configuration to be set beforehand
Singleton Pattern - within a single JVM, getOrCreate() returns the same SparkSession instance if one already exists, preventing resource duplication

# Conceptual pattern
spark = (SparkSession.builder
    .master("local")
    .appName("my_app")
    .getOrCreate())

Related Pages

Implementation:DataExpert_io_Data_engineer_handbook_SparkSession_Builder

Knowledge Sources

Metadata

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment