Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Workflow:Apache Spark Building and Testing

From Leeroopedia


Knowledge Sources
Domains Build_Systems, CI_CD, Testing
Last Updated 2026-02-08 22:00 GMT

Overview

End-to-end process for building Apache Spark from source and running the comprehensive test suite across all supported languages and modules.

Description

This workflow covers the complete build-and-test cycle for the Apache Spark project. Spark uses Apache Maven as its build of reference, with SBT supported for day-to-day development. The build produces JVM artifacts (JARs), PySpark packages, and optionally SparkR packages. The test infrastructure uses a topology-sorted module dependency graph to determine which tests to run based on changed files, enabling selective and efficient testing across the multi-language codebase (Scala, Java, Python, R).

Usage

Execute this workflow when you need to build Spark from source for development, verify code changes via the test suite, create a runnable distribution, or prepare PySpark pip-installable packages. This applies to contributors making pull requests, release managers preparing builds, and developers setting up local development environments.

Execution Steps

Step 1: Environment Setup

Configure the build environment with the required toolchain. Spark requires Maven 3.9.12 and Java 17 or 21. Set Maven memory options to ensure the build has sufficient heap space and code cache. The bundled build/mvn script can automatically download Maven and Scala if not present.

Key considerations:

  • Set MAVEN_OPTS with at least 4GB heap and 64MB stack
  • Java 17 or 21 required; Scala 2.13 is the only supported version since Spark 4.0
  • The build/mvn wrapper handles Maven and Scala downloads automatically

Step 2: Source Compilation

Compile the Spark source code using Maven or SBT. The build supports multiple Maven profiles to enable optional features like YARN, Kubernetes, Hive, and Thrift Server support. The compilation produces JAR artifacts for all Spark modules.

Key considerations:

  • Use -DskipTests for faster initial builds
  • Enable profiles as needed: -Pyarn, -Pkubernetes, -Phive, -Phive-thriftserver
  • SBT can be used for faster iterative compilation during development
  • Submodules can be built individually using mvn -pl

Step 3: Running Tests

Execute the test suite using the dev/run-tests.py orchestrator. This script determines which modules need testing based on changed files (via git diff), resolves module dependencies using topological sort, and runs tests in the correct order. It covers license checks (Apache RAT), Scala/Java unit tests, Python tests, and optionally R tests.

Key considerations:

  • Tests should not be run as root
  • The test runner auto-detects changed modules for selective testing
  • PySpark tests require building with Hive support first
  • The python/run-tests.py script runs PySpark tests in parallel

Step 4: Building Distribution

Create a self-contained binary distribution using dev/make-distribution.sh. This packages compiled Spark with all dependencies into a tarball suitable for deployment. It supports optional inclusion of PySpark pip packages and SparkR packages.

Key considerations:

  • Use --pip flag to include PySpark pip package
  • Use --tgz to create a compressed tarball
  • Enable desired profiles matching the compilation step
  • The distribution includes bin/, sbin/, jars/, and conf/ directories

Step 5: PySpark Package Build

Build the PySpark pip-installable package from the compiled JARs. This step creates an sdist package that can be installed via pip. The package bundles the necessary Spark JARs alongside the Python source code.

Key considerations:

  • Requires JARs from the compilation step to be present
  • Build via python packaging/classic/setup.py sdist in the python/ directory
  • Alternatively, use make-distribution.sh with --pip flag
  • Cannot directly pip install from the Python directory; must build sdist first

Execution Diagram

GitHub URL

Workflow Repository