Principle:Apache Spark Release Documentation Generation

Domains	Release_Engineering, Documentation
Last Updated	2026-02-08 12:00 GMT

Overview

A multi-toolchain documentation build process that generates API documentation for all supported languages from a single coordinated build pipeline.

Description

A polyglot project requires documentation generation using language-specific tools. Spark's documentation pipeline orchestrates Jekyll (static site), Scaladoc/Javadoc (JVM API docs), Sphinx/mkdocs (Python API docs), pkgdown (R API docs), and SQL documentation generators. Each tool is invoked with skip controls allowing selective regeneration. The output is a complete, self-contained documentation site.

The documentation generation process for Apache Spark coordinates multiple independent toolchains:

Toolchain	Language	Tool	Output
Jekyll	Site pages	Ruby/Jekyll	Static HTML site with guides and configuration reference
Scaladoc	Scala API	SBT unidoc	Scala API documentation under `api/scala/`
Javadoc	Java API	SBT unidoc	Java API documentation under `api/java/`
Sphinx/mkdocs	Python API	Python Sphinx + mkdocs	PySpark API documentation under `api/python/`
pkgdown	R API	R pkgdown	SparkR API documentation under `api/R/`
SQL docs	SQL reference	Custom Python scripts	SQL function reference under `api/sql/`
Error docs	Error reference	Custom scripts	Error code reference documentation

Each toolchain can be independently skipped using environment variables, enabling faster development iterations when only one language's documentation needs to be regenerated.

Usage

Use during the release process to generate the full documentation site, or selectively during development to preview documentation changes. Skip controls allow developers to focus on a single language's documentation without waiting for the full pipeline to complete.

Theoretical Basis

The documentation build follows a parallel pipeline model:

jekyll(site_pages) || scaladoc(scala_api) || javadoc(java_api) || sphinx(python_api) || pkgdown(r_api) || sql_docs(sql_ref) -> merge(output_site)

The key design principles are:

Tool-per-language: Each programming language uses its native documentation tool, ensuring the generated docs follow language-specific conventions and standards.
Skip controls: Environment variables (SKIP_SCALADOC, SKIP_PYTHONDOC, SKIP_RDOC, SKIP_SQLDOC, SKIP_ERRORDOC, SKIP_API) allow selective generation, reducing build times during development.
Unified output: Despite using different tools, all documentation is merged into a single output directory (docs/_site/) with consistent navigation and structure.
Production mode: The PRODUCTION=1 flag enables production-specific optimizations and URL configurations for the final release site.

This architecture reflects the inherent polyglot nature of Apache Spark, which provides APIs in Scala, Java, Python, R, and SQL. Each language community expects documentation in their native format, generated by their native tools.

Related Pages

Implemented By

Implementation:Apache_Spark_Build_Api_Docs

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment