Workflow:Apache Spark Building and Testing
| Knowledge Sources | |
|---|---|
| Domains | Build_Systems, CI_CD, Testing |
| Last Updated | 2026-02-08 22:00 GMT |
Overview
End-to-end process for building Apache Spark from source and running the comprehensive test suite across all supported languages and modules.
Description
This workflow covers the complete build-and-test cycle for the Apache Spark project. Spark uses Apache Maven as its build of reference, with SBT supported for day-to-day development. The build produces JVM artifacts (JARs), PySpark packages, and optionally SparkR packages. The test infrastructure uses a topology-sorted module dependency graph to determine which tests to run based on changed files, enabling selective and efficient testing across the multi-language codebase (Scala, Java, Python, R).
Usage
Execute this workflow when you need to build Spark from source for development, verify code changes via the test suite, create a runnable distribution, or prepare PySpark pip-installable packages. This applies to contributors making pull requests, release managers preparing builds, and developers setting up local development environments.
Execution Steps
Step 1: Environment Setup
Configure the build environment with the required toolchain. Spark requires Maven 3.9.12 and Java 17 or 21. Set Maven memory options to ensure the build has sufficient heap space and code cache. The bundled build/mvn script can automatically download Maven and Scala if not present.
Key considerations:
- Set MAVEN_OPTS with at least 4GB heap and 64MB stack
- Java 17 or 21 required; Scala 2.13 is the only supported version since Spark 4.0
- The build/mvn wrapper handles Maven and Scala downloads automatically
Step 2: Source Compilation
Compile the Spark source code using Maven or SBT. The build supports multiple Maven profiles to enable optional features like YARN, Kubernetes, Hive, and Thrift Server support. The compilation produces JAR artifacts for all Spark modules.
Key considerations:
- Use -DskipTests for faster initial builds
- Enable profiles as needed: -Pyarn, -Pkubernetes, -Phive, -Phive-thriftserver
- SBT can be used for faster iterative compilation during development
- Submodules can be built individually using mvn -pl
Step 3: Running Tests
Execute the test suite using the dev/run-tests.py orchestrator. This script determines which modules need testing based on changed files (via git diff), resolves module dependencies using topological sort, and runs tests in the correct order. It covers license checks (Apache RAT), Scala/Java unit tests, Python tests, and optionally R tests.
Key considerations:
- Tests should not be run as root
- The test runner auto-detects changed modules for selective testing
- PySpark tests require building with Hive support first
- The python/run-tests.py script runs PySpark tests in parallel
Step 4: Building Distribution
Create a self-contained binary distribution using dev/make-distribution.sh. This packages compiled Spark with all dependencies into a tarball suitable for deployment. It supports optional inclusion of PySpark pip packages and SparkR packages.
Key considerations:
- Use --pip flag to include PySpark pip package
- Use --tgz to create a compressed tarball
- Enable desired profiles matching the compilation step
- The distribution includes bin/, sbin/, jars/, and conf/ directories
Step 5: PySpark Package Build
Build the PySpark pip-installable package from the compiled JARs. This step creates an sdist package that can be installed via pip. The package bundles the necessary Spark JARs alongside the Python source code.
Key considerations:
- Requires JARs from the compilation step to be present
- Build via python packaging/classic/setup.py sdist in the python/ directory
- Alternatively, use make-distribution.sh with --pip flag
- Cannot directly pip install from the Python directory; must build sdist first