Heuristic:DataExpert io Data engineer handbook SparkSession Singleton Pattern

Knowledge Sources	DataExpert-io/data-engineer-handbook PySpark SparkSession
Domains	Big_Data, Optimization
Last Updated	2026-02-09 06:00 GMT

Overview

Use session-scoped SparkSession fixtures in pytest to avoid JVM startup overhead across multiple test functions.

Description

SparkSession creation involves starting a JVM process, which is expensive (several seconds). The repository uses a session-scoped pytest fixture to create a single SparkSession that is reused across all tests in a test session. This avoids the overhead of starting and stopping the JVM for each individual test function. The fixture uses `.getOrCreate()` which returns an existing session if one is already running, reinforcing the singleton pattern.

Usage

Apply this heuristic when writing PySpark unit tests with pytest. Always use a session-scoped fixture for SparkSession rather than creating a new session per test function or per test class. This is especially important when running large test suites where JVM startup time would otherwise dominate total test execution time.

The Insight (Rule of Thumb)

Action: Define SparkSession as a `@pytest.fixture(scope='session')` in `conftest.py`.
Value: Use `scope='session'` (not `scope='function'` or `scope='class'`).
Trade-off: Session scope means all tests share the same SparkSession state. If a test modifies session-level configuration (e.g., `spark.sql.shuffle.partitions`), it affects subsequent tests. For isolation, use `scope='class'` or `scope='module'` at the cost of additional JVM restarts.
Companion Pattern: Use `.master("local")` for test sessions and a descriptive `.appName()` to identify the test session in the Spark UI.

Reasoning

JVM startup in PySpark takes 3-10 seconds depending on hardware. With 20+ test functions, function-scoped SparkSession fixtures would add 60-200 seconds of pure overhead. Session scope reduces this to a single startup.

The `.getOrCreate()` method ensures that if a SparkSession already exists in the current process, it returns that instance rather than creating a new one. This is the standard singleton pattern for Spark applications and prevents the common error of multiple competing SparkContext instances (Spark only allows one active SparkContext per JVM).

The repository names the test session `"chispa"` after the DataFrame assertion library used, which helps identify the test session in Spark UI logs.

Code Evidence

Session-scoped fixture from `conftest.py:1-9`:

import pytest
from pyspark.sql import SparkSession

@pytest.fixture(scope='session')
def spark():
    return SparkSession.builder \
      .master("local") \
      .appName("chispa") \
      .getOrCreate()

Fixture usage in test files (e.g., `test_monthly_user_site_hits.py`):

def test_monthly_user_site_hits(spark):
    # spark fixture injected automatically by pytest
    input_data = [...]
    input_df = spark.createDataFrame(input_data)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment