Implementation:Apache Spark Python Run Tests

Field	Value
Source Repository	Apache Spark
Domains	Testing, Python
Last Updated	2026-02-08 14:00 GMT

Overview

Parallel PySpark test runner that distributes test modules across worker threads using a priority queue.

Description

The python/run-tests.py script runs PySpark unit tests in parallel. It uses a TestRunner class that manages subprocess execution with PTY support for interactive debugging. Tests are classified as heavy (streaming, mllib, sql, ml, pandas) or light, with heavy tests prioritized in the queue to optimize total execution time. Supports testing across multiple Python executables.

The script operates in the following stages:

Test discovery -- determines which PySpark modules to test based on command-line arguments or the full module list
Test classification -- categorizes modules into heavy and light based on known execution characteristics
Queue construction -- builds a priority queue with heavy tests at priority 0 and light tests at priority 100
Worker dispatch -- spawns N worker threads (controlled by --parallelism) that consume tests from the queue
Result collection -- gathers pass/fail results from each worker and reports overall status

The TestRunner class (Lines 79-233) encapsulates a single test execution:

Manages subprocess lifecycle with configurable timeouts
Supports PTY-based execution for tests that require terminal interaction
Captures stdout/stderr for reporting
Handles cleanup on test failure or timeout

Usage

Use to run the PySpark test suite. Commonly invoked from dev/run-tests.py or directly for PySpark-focused development. This script is the standard way to validate PySpark changes.

Code Reference

Attribute	Details
Source	Repository apache/spark, File python/run-tests.py, Lines 79-233 (TestRunner class), Lines 453-549 (main function)
Signature	`python3 python/run-tests.py [--modules=<list>] [--parallelism=N] [--python-executables=<list>] [--testnames=<names>]`
Key Class	`TestRunner(test_name, cmd, env, test_output, timeout=None)`

I/O Contract

Inputs

Parameter	Type	Required	Description
`--modules`	str	No	Comma-separated PySpark module names to test
`--parallelism`	int	No	Number of worker threads (default 4)
`--python-executables`	str	No	Comma-separated list of Python binary paths to test against
`--testnames`	str	No	Specific test names to run
Compiled Spark + PySpark	implicit	Yes	Spark must be built and PySpark must be available

Outputs

Output	Description
Test results per module	Pass/fail status for each test module printed to stdout
Exit code	0 on success, non-zero on failure
Optional PySpark sdist	Built PySpark sdist package if needed for testing

Usage Examples

Run all PySpark tests:

python3 python/run-tests.py

Run a specific PySpark module:

python3 python/run-tests.py --modules=pyspark-sql

Run with increased parallelism:

python3 python/run-tests.py --parallelism=8

Run with multiple Python executables:

python3 python/run-tests.py --python-executables=python3.9,python3.10

Run specific test names:

python3 python/run-tests.py --testnames="pyspark.sql.tests.test_dataframe"

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment