Principle:Apache Spark Test Orchestration
| Field | Value |
|---|---|
| Sources | https://github.com/apache/spark |
| Domains | Testing, CI_CD |
| Last Updated | 2026-02-08 14:00 GMT |
Overview
A CI/CD orchestration strategy that determines which test modules to execute based on changed files and module dependency graphs, enabling efficient selective testing of large-scale projects.
Description
In large monorepo projects, running the entire test suite for every change is prohibitively expensive. Test orchestration solves this by mapping changed files to affected modules using a dependency graph, then topologically sorting those modules to determine the minimal set of tests needed.
Apache Spark's test orchestrator uses a Module class that defines source file patterns, dependencies between modules, and associated test commands. When files change, the system traces which modules contain those files, then expands the set to include all transitive dependents.
The orchestration process follows these steps:
- Detect changed files via
git diff - Map each changed file to its owning module using glob patterns
- Expand the affected set by walking the dependency graph to find all transitive dependents
- Topologically sort the expanded module set to respect build ordering
- Execute test commands for each module in the sorted order
This approach ensures that only the tests relevant to a given change are executed, dramatically reducing CI time for incremental changes while still guaranteeing correctness through transitive dependency tracking.
Usage
Use this principle when designing CI/CD pipelines for large multi-module projects where full test suite execution is too slow. It is particularly effective when the project has a well-defined module dependency graph and clear file-to-module mapping.
Conditions where this principle applies:
- The project is a monorepo with multiple interdependent modules
- Full test suite execution takes longer than the acceptable feedback loop
- Modules have well-defined source file boundaries
- A dependency graph between modules can be statically determined
Theoretical Basis
The core algorithm is a file-to-module mapping followed by transitive dependency expansion using topological sort. The process can be expressed in pseudocode:
changed_files = git_diff()
affected_modules = map_files_to_modules(changed_files)
expanded = transitive_closure(affected_modules, dependency_graph)
ordered = topological_sort(expanded)
run_tests(ordered)
The transitive closure step is critical: if module A depends on module B, and module B is affected by a change, then module A must also be tested because its behavior may have changed. The topological sort ensures that modules are tested in dependency order, so failures in foundational modules are detected before dependent modules are tested.
This is effectively a graph reachability problem on a directed acyclic graph (DAG), where the dependency edges point from dependent to dependency. The transitive closure walks these edges in reverse (from dependency to dependent) to find all affected modules.