Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Apache Spark Python Run Tests

From Leeroopedia


Field Value
Source Repository Apache Spark
Domains Testing, Python
Last Updated 2026-02-08 14:00 GMT

Overview

Parallel PySpark test runner that distributes test modules across worker threads using a priority queue.

Description

The python/run-tests.py script runs PySpark unit tests in parallel. It uses a TestRunner class that manages subprocess execution with PTY support for interactive debugging. Tests are classified as heavy (streaming, mllib, sql, ml, pandas) or light, with heavy tests prioritized in the queue to optimize total execution time. Supports testing across multiple Python executables.

The script operates in the following stages:

  • Test discovery -- determines which PySpark modules to test based on command-line arguments or the full module list
  • Test classification -- categorizes modules into heavy and light based on known execution characteristics
  • Queue construction -- builds a priority queue with heavy tests at priority 0 and light tests at priority 100
  • Worker dispatch -- spawns N worker threads (controlled by --parallelism) that consume tests from the queue
  • Result collection -- gathers pass/fail results from each worker and reports overall status

The TestRunner class (Lines 79-233) encapsulates a single test execution:

  • Manages subprocess lifecycle with configurable timeouts
  • Supports PTY-based execution for tests that require terminal interaction
  • Captures stdout/stderr for reporting
  • Handles cleanup on test failure or timeout

Usage

Use to run the PySpark test suite. Commonly invoked from dev/run-tests.py or directly for PySpark-focused development. This script is the standard way to validate PySpark changes.

Code Reference

Attribute Details
Source Repository apache/spark, File python/run-tests.py, Lines 79-233 (TestRunner class), Lines 453-549 (main function)
Signature python3 python/run-tests.py [--modules=<list>] [--parallelism=N] [--python-executables=<list>] [--testnames=<names>]
Key Class TestRunner(test_name, cmd, env, test_output, timeout=None)

I/O Contract

Inputs

Parameter Type Required Description
--modules str No Comma-separated PySpark module names to test
--parallelism int No Number of worker threads (default 4)
--python-executables str No Comma-separated list of Python binary paths to test against
--testnames str No Specific test names to run
Compiled Spark + PySpark implicit Yes Spark must be built and PySpark must be available

Outputs

Output Description
Test results per module Pass/fail status for each test module printed to stdout
Exit code 0 on success, non-zero on failure
Optional PySpark sdist Built PySpark sdist package if needed for testing

Usage Examples

Run all PySpark tests:

python3 python/run-tests.py

Run a specific PySpark module:

python3 python/run-tests.py --modules=pyspark-sql

Run with increased parallelism:

python3 python/run-tests.py --parallelism=8

Run with multiple Python executables:

python3 python/run-tests.py --python-executables=python3.9,python3.10

Run specific test names:

python3 python/run-tests.py --testnames="pyspark.sql.tests.test_dataframe"

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment