Principle:Apache Spark Test Orchestration

Field	Value
Sources	https://github.com/apache/spark
Domains	Testing, CI_CD
Last Updated	2026-02-08 14:00 GMT

Overview

A CI/CD orchestration strategy that determines which test modules to execute based on changed files and module dependency graphs, enabling efficient selective testing of large-scale projects.

Description

In large monorepo projects, running the entire test suite for every change is prohibitively expensive. Test orchestration solves this by mapping changed files to affected modules using a dependency graph, then topologically sorting those modules to determine the minimal set of tests needed.

Apache Spark's test orchestrator uses a Module class that defines source file patterns, dependencies between modules, and associated test commands. When files change, the system traces which modules contain those files, then expands the set to include all transitive dependents.

The orchestration process follows these steps:

Detect changed files via git diff
Map each changed file to its owning module using glob patterns
Expand the affected set by walking the dependency graph to find all transitive dependents
Topologically sort the expanded module set to respect build ordering
Execute test commands for each module in the sorted order

This approach ensures that only the tests relevant to a given change are executed, dramatically reducing CI time for incremental changes while still guaranteeing correctness through transitive dependency tracking.

Usage

Use this principle when designing CI/CD pipelines for large multi-module projects where full test suite execution is too slow. It is particularly effective when the project has a well-defined module dependency graph and clear file-to-module mapping.

Conditions where this principle applies:

The project is a monorepo with multiple interdependent modules
Full test suite execution takes longer than the acceptable feedback loop
Modules have well-defined source file boundaries
A dependency graph between modules can be statically determined

Theoretical Basis

The core algorithm is a file-to-module mapping followed by transitive dependency expansion using topological sort. The process can be expressed in pseudocode:

changed_files = git_diff()
affected_modules = map_files_to_modules(changed_files)
expanded = transitive_closure(affected_modules, dependency_graph)
ordered = topological_sort(expanded)
run_tests(ordered)

The transitive closure step is critical: if module A depends on module B, and module B is affected by a change, then module A must also be tested because its behavior may have changed. The topological sort ensures that modules are tested in dependency order, so failures in foundational modules are detected before dependent modules are tested.

This is effectively a graph reachability problem on a directed acyclic graph (DAG), where the dependency edges point from dependent to dependency. The transitive closure walks these edges in reverse (from dependency to dependent) to find all affected modules.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment