Heuristic:DataExpert io Data engineer handbook SparkSession Singleton Pattern
| Knowledge Sources | |
|---|---|
| Domains | Big_Data, Optimization |
| Last Updated | 2026-02-09 06:00 GMT |
Overview
Use session-scoped SparkSession fixtures in pytest to avoid JVM startup overhead across multiple test functions.
Description
SparkSession creation involves starting a JVM process, which is expensive (several seconds). The repository uses a session-scoped pytest fixture to create a single SparkSession that is reused across all tests in a test session. This avoids the overhead of starting and stopping the JVM for each individual test function. The fixture uses `.getOrCreate()` which returns an existing session if one is already running, reinforcing the singleton pattern.
Usage
Apply this heuristic when writing PySpark unit tests with pytest. Always use a session-scoped fixture for SparkSession rather than creating a new session per test function or per test class. This is especially important when running large test suites where JVM startup time would otherwise dominate total test execution time.
The Insight (Rule of Thumb)
- Action: Define SparkSession as a `@pytest.fixture(scope='session')` in `conftest.py`.
- Value: Use `scope='session'` (not `scope='function'` or `scope='class'`).
- Trade-off: Session scope means all tests share the same SparkSession state. If a test modifies session-level configuration (e.g., `spark.sql.shuffle.partitions`), it affects subsequent tests. For isolation, use `scope='class'` or `scope='module'` at the cost of additional JVM restarts.
- Companion Pattern: Use `.master("local")` for test sessions and a descriptive `.appName()` to identify the test session in the Spark UI.
Reasoning
JVM startup in PySpark takes 3-10 seconds depending on hardware. With 20+ test functions, function-scoped SparkSession fixtures would add 60-200 seconds of pure overhead. Session scope reduces this to a single startup.
The `.getOrCreate()` method ensures that if a SparkSession already exists in the current process, it returns that instance rather than creating a new one. This is the standard singleton pattern for Spark applications and prevents the common error of multiple competing SparkContext instances (Spark only allows one active SparkContext per JVM).
The repository names the test session `"chispa"` after the DataFrame assertion library used, which helps identify the test session in Spark UI logs.
Code Evidence
Session-scoped fixture from `conftest.py:1-9`:
import pytest
from pyspark.sql import SparkSession
@pytest.fixture(scope='session')
def spark():
return SparkSession.builder \
.master("local") \
.appName("chispa") \
.getOrCreate()
Fixture usage in test files (e.g., `test_monthly_user_site_hits.py`):
def test_monthly_user_site_hits(spark):
# spark fixture injected automatically by pytest
input_data = [...]
input_df = spark.createDataFrame(input_data)
Related Pages
- Implementation:DataExpert_io_Data_engineer_handbook_Pytest_Spark_Fixture
- Implementation:DataExpert_io_Data_engineer_handbook_Namedtuple_CreateDataFrame_Pattern
- Implementation:DataExpert_io_Data_engineer_handbook_Chispa_Assert_df_equality
- Principle:DataExpert_io_Data_engineer_handbook_SparkSession_Test_Fixture