Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Workflow:Apache Spark Release Process

From Leeroopedia


Knowledge Sources
Domains Release_Engineering, CI_CD, Build_Systems
Last Updated 2026-02-08 22:00 GMT

Overview

End-to-end process for creating an official Apache Spark release candidate, from tagging the source tree through building binary distributions to publishing artifacts.

Description

This workflow covers the complete Apache Spark release process, which is orchestrated through a Dockerized build environment for reproducibility. The process includes creating a release tag from a specified branch, building source and binary distributions with multiple Hadoop profiles, generating documentation, publishing Maven artifacts to a staging repository, and producing the final release artifacts. The release scripts support both full releases and dry-run modes for testing. A contributor list generator and LLMs.txt generator are also included as part of the release tooling.

Usage

Execute this workflow when preparing a new Apache Spark release candidate for community voting. This is performed by the designated release manager and typically follows the Apache Software Foundation release process. It can also be used in dry-run mode to validate the build before an official release attempt.

Execution Steps

Step 1: Release Environment Setup

Set up the Dockerized release environment using dev/create-release/do-release-docker.sh. This script builds a "spark-rm" Docker image containing all required build tools and dependencies, ensuring a reproducible and isolated build environment. The environment requires a configured working directory for output artifacts.

Key considerations:

  • The Docker image is rebuilt as needed on each invocation
  • A working directory must be specified with the -d flag
  • The -n flag enables dry-run mode for testing without uploads
  • Individual steps (tag, build, docs, publish, finalize) can be run separately with -s
  • Custom JDK paths can be specified with -j; defaults to OpenJDK 17

Step 2: Release Tagging

Create a git release tag from the specified branch using dev/create-release/release-tag.sh. This step updates version numbers in POM files and other configuration, commits the version changes, and creates a signed git tag for the release candidate.

Key considerations:

  • The tag follows the pattern vX.Y.Z-rcN
  • Version numbers are updated across all POM files
  • The tag is pushed to the ASF git repository
  • Shared utility functions from release-util.sh handle configuration

Step 3: Binary Distribution Build

Build the source and binary distributions using dev/create-release/release-build.sh. This is the largest and most complex step, producing multiple distribution artifacts: source tarball, binary tarballs for different Hadoop versions, PySpark pip package, and SparkR CRAN package.

Key considerations:

  • Builds are executed with Maven using multiple Hadoop profiles
  • Binary distributions are created via dev/make-distribution.sh
  • The PySpark pip package is built and included
  • Source distributions include the complete source tree with ASF licensing
  • GPG signatures and SHA checksums are generated for all artifacts

Step 4: Documentation Generation

Build the complete Spark documentation site, including API documentation for Scala, Java, Python, and R. The documentation is generated using Jekyll with custom plugins for code examples, API doc cross-references, and version-specific content.

Key considerations:

  • Jekyll-based site generation with custom Ruby plugins
  • API docs generated for all four supported languages
  • SQL documentation generated from Spark SQL function metadata
  • Error documentation auto-generated from error class definitions
  • Documentation is versioned per release

Step 5: Artifact Publishing

Publish Maven artifacts to the ASF staging repository and upload binary distributions to the ASF SVN distribution area. This step makes the release candidate available for community voting and verification.

Key considerations:

  • Maven artifacts are published to a staging repository
  • Binary distributions are uploaded to SVN
  • Contributor list is generated between the previous release and this tag
  • The finalize step closes the staging repository
  • Dry-run mode skips all remote uploads

Step 6: Release Finalization

Close the Maven staging repository and prepare the release vote email. The contributor list generator (generate-contributors.py) produces a list of all contributors between releases. The release artifacts are now ready for community voting.

Key considerations:

  • The staging repository is closed to prevent further modifications
  • Contributor list covers all commits between the previous and current release
  • The LLMs.txt generator creates a machine-readable project summary
  • Community voting follows ASF governance procedures
  • After a successful vote, artifacts are promoted to the release area

Execution Diagram

GitHub URL

Workflow Repository