Workflow:Apache Spark Release Process
| Knowledge Sources | |
|---|---|
| Domains | Release_Engineering, CI_CD, Build_Systems |
| Last Updated | 2026-02-08 22:00 GMT |
Overview
End-to-end process for creating an official Apache Spark release candidate, from tagging the source tree through building binary distributions to publishing artifacts.
Description
This workflow covers the complete Apache Spark release process, which is orchestrated through a Dockerized build environment for reproducibility. The process includes creating a release tag from a specified branch, building source and binary distributions with multiple Hadoop profiles, generating documentation, publishing Maven artifacts to a staging repository, and producing the final release artifacts. The release scripts support both full releases and dry-run modes for testing. A contributor list generator and LLMs.txt generator are also included as part of the release tooling.
Usage
Execute this workflow when preparing a new Apache Spark release candidate for community voting. This is performed by the designated release manager and typically follows the Apache Software Foundation release process. It can also be used in dry-run mode to validate the build before an official release attempt.
Execution Steps
Step 1: Release Environment Setup
Set up the Dockerized release environment using dev/create-release/do-release-docker.sh. This script builds a "spark-rm" Docker image containing all required build tools and dependencies, ensuring a reproducible and isolated build environment. The environment requires a configured working directory for output artifacts.
Key considerations:
- The Docker image is rebuilt as needed on each invocation
- A working directory must be specified with the -d flag
- The -n flag enables dry-run mode for testing without uploads
- Individual steps (tag, build, docs, publish, finalize) can be run separately with -s
- Custom JDK paths can be specified with -j; defaults to OpenJDK 17
Step 2: Release Tagging
Create a git release tag from the specified branch using dev/create-release/release-tag.sh. This step updates version numbers in POM files and other configuration, commits the version changes, and creates a signed git tag for the release candidate.
Key considerations:
- The tag follows the pattern vX.Y.Z-rcN
- Version numbers are updated across all POM files
- The tag is pushed to the ASF git repository
- Shared utility functions from release-util.sh handle configuration
Step 3: Binary Distribution Build
Build the source and binary distributions using dev/create-release/release-build.sh. This is the largest and most complex step, producing multiple distribution artifacts: source tarball, binary tarballs for different Hadoop versions, PySpark pip package, and SparkR CRAN package.
Key considerations:
- Builds are executed with Maven using multiple Hadoop profiles
- Binary distributions are created via dev/make-distribution.sh
- The PySpark pip package is built and included
- Source distributions include the complete source tree with ASF licensing
- GPG signatures and SHA checksums are generated for all artifacts
Step 4: Documentation Generation
Build the complete Spark documentation site, including API documentation for Scala, Java, Python, and R. The documentation is generated using Jekyll with custom plugins for code examples, API doc cross-references, and version-specific content.
Key considerations:
- Jekyll-based site generation with custom Ruby plugins
- API docs generated for all four supported languages
- SQL documentation generated from Spark SQL function metadata
- Error documentation auto-generated from error class definitions
- Documentation is versioned per release
Step 5: Artifact Publishing
Publish Maven artifacts to the ASF staging repository and upload binary distributions to the ASF SVN distribution area. This step makes the release candidate available for community voting and verification.
Key considerations:
- Maven artifacts are published to a staging repository
- Binary distributions are uploaded to SVN
- Contributor list is generated between the previous release and this tag
- The finalize step closes the staging repository
- Dry-run mode skips all remote uploads
Step 6: Release Finalization
Close the Maven staging repository and prepare the release vote email. The contributor list generator (generate-contributors.py) produces a list of all contributors between releases. The release artifacts are now ready for community voting.
Key considerations:
- The staging repository is closed to prevent further modifications
- Contributor list covers all commits between the previous and current release
- The LLMs.txt generator creates a machine-readable project summary
- Community voting follows ASF governance procedures
- After a successful vote, artifacts are promoted to the release area