Implementation:Apache Spark Run Tests

Field	Value
Source Repository	Apache Spark
Domains	Testing, CI_CD
Last Updated	2026-02-08 14:00 GMT

Overview

Central CI/CD test orchestrator script for running the full Apache Spark test suite across all languages.

Description

The dev/run-tests.py script is the main entry point for Spark's CI pipeline. It determines which modules to test based on git changes, runs style checks (Apache RAT, Scala, Java, Python, R), builds Spark, performs binary compatibility checks (MiMa), and executes test suites for Scala, Python, and R.

The script orchestrates the following stages in order:

License checking via Apache RAT to ensure all files carry proper license headers
Style checking for Scala, Java, Python, and R source files
Building Spark using Maven or SBT depending on configuration
Binary compatibility checking via MiMa (Migration Manager)
Module-based test execution for Scala/Java tests via SBT or Maven
PySpark test execution by delegating to python/run-tests.py
SparkR test execution by delegating to R/run-tests.sh

The script uses the Module class defined in dev/sparktestsupport/modules.py to determine which modules are affected by a given set of file changes. It then uses topological sort from dev/sparktestsupport/toposort.py to order the modules correctly before executing their associated test commands.

Usage

Use this script to run the CI test suite locally or in automated CI environments like GitHub Actions. It is the canonical way to validate Spark changes before submitting pull requests.

Code Reference

Attribute	Details
Source	Repository apache/spark, File dev/run-tests.py, Lines 465-656 (main function)
Supporting Files	dev/sparktestsupport/modules.py (Module class L27-48), dev/sparktestsupport/utils.py (determine_modules_for_files L32-167), dev/sparktestsupport/toposort.py (toposort_flatten L41-84)
Signature	`python3 dev/run-tests.py [--modules=<list>] [--parallelism=N] [--excluded-tags=<tags>] [--included-tags=<tags>]`
Import	N/A (standalone script)

I/O Contract

Inputs

Parameter	Type	Required	Description
`--modules`	str	No	Comma-separated module names to test
`--parallelism`	int	No	Number of parallel test processes (default 4)
`--excluded-tags`	str	No	Test tags to exclude from execution
`--included-tags`	str	No	Test tags to include in execution
Compiled Spark source	implicit	Yes	Spark must be built first (the script handles this)

Outputs

Output	Description
Test results	Pass/fail status per module printed to stdout
Exit code	0 on success; error codes defined in sparktestsupport/__init__.py

Usage Examples

Run all tests:

python3 dev/run-tests.py

Run specific modules:

python3 dev/run-tests.py --modules=core,sql

Run with increased parallelism:

python3 dev/run-tests.py --parallelism=8

Run with tag filtering:

python3 dev/run-tests.py --excluded-tags=org.apache.spark.tags.SlowTest

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment