Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:DataExpert io Data engineer handbook SparkSession Test Fixture

From Leeroopedia


Overview

The SparkSession Test Fixture principle addresses the theory of shared test fixtures in PySpark testing. Because creating a SparkSession involves expensive JVM startup overhead, tests that repeatedly instantiate new sessions suffer significant performance penalties. This principle prescribes the use of session-scoped fixtures to ensure a single SparkSession is created once and reused across an entire test suite.

Theory of Shared Test Fixtures

In PySpark testing, the SparkSession is the entry point for all DataFrame and SQL operations. Constructing a SparkSession triggers:

  • JVM process initialization
  • Spark context configuration
  • Internal catalog and scheduler setup

These operations are computationally expensive (often several seconds). Without fixture sharing, each test function would pay this cost independently, resulting in test suites that are orders of magnitude slower than necessary.

The solution is to treat the SparkSession as a shared test fixture — a resource that is initialized once, shared across all tests that depend on it, and torn down only after all tests have completed.

Session-Scoped Fixtures and pytest Dependency Injection

The pytest framework provides a fixture mechanism with configurable scoping. A fixture's scope determines its lifetime:

  • function scope (default) — the fixture is created and destroyed for each test function. This is the most isolated but most expensive option for SparkSession.
  • module scope — the fixture is created once per test module (file) and shared among all tests in that module.
  • session scope — the fixture is created once per entire test session and shared among all tests across all modules. This is the recommended scope for SparkSession.

Using pytest's dependency injection pattern, any test function that declares a parameter matching the fixture name will automatically receive the fixture value. No explicit import or instantiation is required in individual test files.

Fixture Scoping: Session vs Function vs Module

Scope Lifetime SparkSession Suitability
function Per-test Poor — JVM startup per test
module Per-file Acceptable for small suites
session Entire test run Recommended — single JVM startup

Singleton SparkSession

The underlying Spark framework itself enforces a singleton pattern for SparkSession within a single JVM. Calling SparkSession.builder.getOrCreate() returns the existing session if one is already active. The session-scoped fixture aligns with this singleton behavior, ensuring the test framework and Spark runtime agree on the session lifecycle.

When to Apply

This principle applies when:

  • Testing PySpark transformations that require a SparkSession
  • Running a test suite with multiple test functions or modules
  • Performance of the test suite is a concern (i.e., nearly always)

Related Pages

Metadata

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment