Principle:Apache Hudi CI Parallel Test Orchestration
| Knowledge Sources | |
|---|---|
| Domains | CI_CD, Testing, Code_Coverage |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Strategy for distributing a large-scale test suite across multiple parallel CI jobs with post-execution coverage aggregation to minimize feedback latency.
Description
CI Parallel Test Orchestration addresses the challenge of running a comprehensive test suite (unit, functional, and integration tests) for a large multi-module project within practical time constraints. Rather than executing all tests sequentially in a single job, the test suite is partitioned into independent groups based on module boundaries and test types (Java vs Scala, functional vs non-functional, DML vs DDL). Each group runs as an independent CI job, allowing the CI system to execute them concurrently.
A critical companion concern is coverage aggregation: when tests run across separate processes and jobs, each produces its own coverage execution data. A final dependent job collects all per-job coverage artifacts and merges them into a single unified report, ensuring the project maintains an accurate view of overall test coverage despite the distributed execution.
This principle also addresses the need to run certain test groups inside Docker containers when the tests require specific system-level dependencies (e.g., Docker-in-Docker for integration tests).
Usage
Apply this principle when:
- A project's full test suite exceeds the practical timeout for a single CI job (commonly 120 minutes)
- Tests can be partitioned by module, language, or test category without inter-dependencies
- Aggregated code coverage reporting is required despite distributed execution
- Some test groups need isolated container environments while others can run on the host agent
This is the standard approach for large Apache projects and multi-module Maven builds where sequential test execution would create unacceptable CI feedback loops.
Theoretical Basis
The core mechanism relies on three properties:
1. Test Independence: Unit and functional tests in separate modules have no runtime dependencies on each other, enabling safe parallel execution.
2. Partitioning Strategy: Tests are divided along natural boundaries:
- Module boundary: Each Maven module's tests run independently
- Test type: Unit tests (
-Punit-tests) vs functional tests (-Pfunctional-tests) - Language boundary: Java tests vs Scala test suites
- Package boundary: Functional package tests vs non-functional package tests
3. Coverage Merge Associativity: JaCoCo execution data files (.exec) can be merged in any order. The merge operation is associative:
# Pseudo-code for coverage aggregation
# Each job_i produces: exec_i = coverage(tests_i)
# Final coverage = merge(exec_1, exec_2, ..., exec_n)
# Property: merge is order-independent and associative
final_exec = merge(per_job_execs)
report = generate_report(final_exec, source_files, class_files)
The pipeline DAG follows a fan-out/fan-in pattern:
# Abstract pipeline structure
jobs = [partition(test_suite, N)] # Fan-out: N parallel jobs
results = parallel_execute(jobs) # All run concurrently
coverage = aggregate(results) # Fan-in: single merge job