Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:DataExpert io Data engineer handbook Spark Session Configuration

From Leeroopedia


Overview

SparkSession serves as the unified entry point to all Spark functionality. It replaced the older SparkContext and SQLContext interfaces, consolidating them into a single object that provides access to DataFrame and SQL operations, configuration management, and catalog interactions.

Theory

The SparkSession follows the builder pattern for construction, allowing callers to chain configuration methods before creating the session. Key configuration options include:

  • master - specifies the cluster manager URL (e.g., local, yarn, spark://host:port)
  • appName - sets a human-readable name for the Spark application, visible in the Spark UI
  • config - sets arbitrary Spark configuration key-value pairs (e.g., catalog implementations, warehouse directories)

The builder culminates in a call to getOrCreate(), which either creates a new session or returns an existing one.

When to Apply

Any PySpark application must create a SparkSession before performing data operations. This is always the first step in:

  • Batch ETL jobs
  • Interactive notebook sessions
  • Streaming applications
  • Ad-hoc SQL queries against catalog-managed tables

Theoretical Basis

The SparkSession relies on two design patterns:

  • Lazy Initialization - the session and its underlying SparkContext are not created until getOrCreate() is invoked, allowing all configuration to be set beforehand
  • Singleton Pattern - within a single JVM, getOrCreate() returns the same SparkSession instance if one already exists, preventing resource duplication
# Conceptual pattern
spark = (SparkSession.builder
    .master("local")
    .appName("my_app")
    .getOrCreate())

Related Pages

Knowledge Sources

Metadata

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment