Principle:Apache Spark Release Artifact Build
| Domains | Release_Engineering, Packaging |
|---|---|
| Last Updated | 2026-02-08 12:00 GMT |
Overview
A multi-variant build process that generates signed, checksummed release artifacts for different platform configurations from a single tagged source.
Description
Official releases must produce multiple distribution variants (different Hadoop versions, with/without PySpark, with/without SparkR) along with cryptographic signatures and checksums for verification. The build matrix defines which combinations to produce. Each artifact gets a GPG signature (.asc) and SHA-512 checksum for integrity verification. Source tarballs are also produced for users who prefer building from source.
The Apache Spark release build process generates a comprehensive set of artifacts:
- Source tarball: The complete source code at the tagged version, allowing users to build from source with their own configurations.
- Binary distributions: Pre-built distributions for each variant in the build matrix, typically including different Hadoop compatibility profiles and Scala versions.
- PySpark packages: Python pip-installable packages for PyPI distribution.
- SparkR packages: R CRAN-compatible packages for the SparkR bindings.
Each artifact undergoes cryptographic processing:
- GPG signature (
.asc): Proves the artifact was produced by an authorized release manager. - SHA-512 checksum (
.sha512): Allows users to verify download integrity.
The build matrix is defined within the release scripts, specifying which Maven profiles, build flags, and Scala versions to use for each variant. This ensures that the full set of supported configurations is produced consistently for every release.
Usage
Use after tagging to produce all release artifacts for the RC vote. The build step is typically the most time-consuming part of the release process, as it compiles the project multiple times with different configurations.
Theoretical Basis
The build follows a matrix expansion model:
for each variant in matrix:
build(source, variant_profiles) -> sign(gpg_key) -> checksum(sha512) -> upload(staging_repo)
The key properties of this approach are:
- Completeness: The build matrix ensures all supported platform configurations are covered in every release.
- Integrity: Dual verification (GPG signature + SHA-512 checksum) provides both authentication and integrity guarantees.
- Reproducibility: Building from a tagged commit with explicit Maven profiles ensures the build is deterministic.
- Flexibility: The
make_binary_releasefunction accepts parameterized inputs (distribution name, Maven flags, build package flags, Scala version), making it straightforward to add or modify build variants.
The separation of source and binary distributions serves two audiences: power users who want to customize their build, and operational users who want pre-built binaries. Both audiences benefit from the same cryptographic verification chain.