Implementation:Apache Spark Python Run Tests
| Field | Value |
|---|---|
| Source Repository | Apache Spark |
| Domains | Testing, Python |
| Last Updated | 2026-02-08 14:00 GMT |
Overview
Parallel PySpark test runner that distributes test modules across worker threads using a priority queue.
Description
The python/run-tests.py script runs PySpark unit tests in parallel. It uses a TestRunner class that manages subprocess execution with PTY support for interactive debugging. Tests are classified as heavy (streaming, mllib, sql, ml, pandas) or light, with heavy tests prioritized in the queue to optimize total execution time. Supports testing across multiple Python executables.
The script operates in the following stages:
- Test discovery -- determines which PySpark modules to test based on command-line arguments or the full module list
- Test classification -- categorizes modules into heavy and light based on known execution characteristics
- Queue construction -- builds a priority queue with heavy tests at priority 0 and light tests at priority 100
- Worker dispatch -- spawns N worker threads (controlled by --parallelism) that consume tests from the queue
- Result collection -- gathers pass/fail results from each worker and reports overall status
The TestRunner class (Lines 79-233) encapsulates a single test execution:
- Manages subprocess lifecycle with configurable timeouts
- Supports PTY-based execution for tests that require terminal interaction
- Captures stdout/stderr for reporting
- Handles cleanup on test failure or timeout
Usage
Use to run the PySpark test suite. Commonly invoked from dev/run-tests.py or directly for PySpark-focused development. This script is the standard way to validate PySpark changes.
Code Reference
| Attribute | Details |
|---|---|
| Source | Repository apache/spark, File python/run-tests.py, Lines 79-233 (TestRunner class), Lines 453-549 (main function) |
| Signature | python3 python/run-tests.py [--modules=<list>] [--parallelism=N] [--python-executables=<list>] [--testnames=<names>]
|
| Key Class | TestRunner(test_name, cmd, env, test_output, timeout=None)
|
I/O Contract
Inputs
| Parameter | Type | Required | Description |
|---|---|---|---|
--modules |
str | No | Comma-separated PySpark module names to test |
--parallelism |
int | No | Number of worker threads (default 4) |
--python-executables |
str | No | Comma-separated list of Python binary paths to test against |
--testnames |
str | No | Specific test names to run |
| Compiled Spark + PySpark | implicit | Yes | Spark must be built and PySpark must be available |
Outputs
| Output | Description |
|---|---|
| Test results per module | Pass/fail status for each test module printed to stdout |
| Exit code | 0 on success, non-zero on failure |
| Optional PySpark sdist | Built PySpark sdist package if needed for testing |
Usage Examples
Run all PySpark tests:
python3 python/run-tests.py
Run a specific PySpark module:
python3 python/run-tests.py --modules=pyspark-sql
Run with increased parallelism:
python3 python/run-tests.py --parallelism=8
Run with multiple Python executables:
python3 python/run-tests.py --python-executables=python3.9,python3.10
Run specific test names:
python3 python/run-tests.py --testnames="pyspark.sql.tests.test_dataframe"