Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Apache Spark Test Orchestration

From Leeroopedia


Field Value
Sources https://github.com/apache/spark
Domains Testing, CI_CD
Last Updated 2026-02-08 14:00 GMT

Overview

A CI/CD orchestration strategy that determines which test modules to execute based on changed files and module dependency graphs, enabling efficient selective testing of large-scale projects.

Description

In large monorepo projects, running the entire test suite for every change is prohibitively expensive. Test orchestration solves this by mapping changed files to affected modules using a dependency graph, then topologically sorting those modules to determine the minimal set of tests needed.

Apache Spark's test orchestrator uses a Module class that defines source file patterns, dependencies between modules, and associated test commands. When files change, the system traces which modules contain those files, then expands the set to include all transitive dependents.

The orchestration process follows these steps:

  • Detect changed files via git diff
  • Map each changed file to its owning module using glob patterns
  • Expand the affected set by walking the dependency graph to find all transitive dependents
  • Topologically sort the expanded module set to respect build ordering
  • Execute test commands for each module in the sorted order

This approach ensures that only the tests relevant to a given change are executed, dramatically reducing CI time for incremental changes while still guaranteeing correctness through transitive dependency tracking.

Usage

Use this principle when designing CI/CD pipelines for large multi-module projects where full test suite execution is too slow. It is particularly effective when the project has a well-defined module dependency graph and clear file-to-module mapping.

Conditions where this principle applies:

  • The project is a monorepo with multiple interdependent modules
  • Full test suite execution takes longer than the acceptable feedback loop
  • Modules have well-defined source file boundaries
  • A dependency graph between modules can be statically determined

Theoretical Basis

The core algorithm is a file-to-module mapping followed by transitive dependency expansion using topological sort. The process can be expressed in pseudocode:

changed_files = git_diff()
affected_modules = map_files_to_modules(changed_files)
expanded = transitive_closure(affected_modules, dependency_graph)
ordered = topological_sort(expanded)
run_tests(ordered)

The transitive closure step is critical: if module A depends on module B, and module B is affected by a change, then module A must also be tested because its behavior may have changed. The topological sort ensures that modules are tested in dependency order, so failures in foundational modules are detected before dependent modules are tested.

This is effectively a graph reachability problem on a directed acyclic graph (DAG), where the dependency edges point from dependent to dependency. The transitive closure walks these edges in reverse (from dependency to dependent) to find all affected modules.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment