Principle:Apache Hudi CI Parallel Test Orchestration

Knowledge Sources	Azure Pipelines Parallelism JaCoCo Coverage Merge
Domains	CI_CD, Testing, Code_Coverage
Last Updated	2026-02-08 00:00 GMT

Overview

Strategy for distributing a large-scale test suite across multiple parallel CI jobs with post-execution coverage aggregation to minimize feedback latency.

Description

CI Parallel Test Orchestration addresses the challenge of running a comprehensive test suite (unit, functional, and integration tests) for a large multi-module project within practical time constraints. Rather than executing all tests sequentially in a single job, the test suite is partitioned into independent groups based on module boundaries and test types (Java vs Scala, functional vs non-functional, DML vs DDL). Each group runs as an independent CI job, allowing the CI system to execute them concurrently.

A critical companion concern is coverage aggregation: when tests run across separate processes and jobs, each produces its own coverage execution data. A final dependent job collects all per-job coverage artifacts and merges them into a single unified report, ensuring the project maintains an accurate view of overall test coverage despite the distributed execution.

This principle also addresses the need to run certain test groups inside Docker containers when the tests require specific system-level dependencies (e.g., Docker-in-Docker for integration tests).

Usage

Apply this principle when:

A project's full test suite exceeds the practical timeout for a single CI job (commonly 120 minutes)
Tests can be partitioned by module, language, or test category without inter-dependencies
Aggregated code coverage reporting is required despite distributed execution
Some test groups need isolated container environments while others can run on the host agent

This is the standard approach for large Apache projects and multi-module Maven builds where sequential test execution would create unacceptable CI feedback loops.

Theoretical Basis

The core mechanism relies on three properties:

1. Test Independence: Unit and functional tests in separate modules have no runtime dependencies on each other, enabling safe parallel execution.

2. Partitioning Strategy: Tests are divided along natural boundaries:

Module boundary: Each Maven module's tests run independently
Test type: Unit tests (-Punit-tests) vs functional tests (-Pfunctional-tests)
Language boundary: Java tests vs Scala test suites
Package boundary: Functional package tests vs non-functional package tests

3. Coverage Merge Associativity: JaCoCo execution data files (.exec) can be merged in any order. The merge operation is associative:

# Pseudo-code for coverage aggregation
# Each job_i produces: exec_i = coverage(tests_i)
# Final coverage = merge(exec_1, exec_2, ..., exec_n)
# Property: merge is order-independent and associative
final_exec = merge(per_job_execs)
report = generate_report(final_exec, source_files, class_files)

The pipeline DAG follows a fan-out/fan-in pattern:

# Abstract pipeline structure
jobs = [partition(test_suite, N)]  # Fan-out: N parallel jobs
results = parallel_execute(jobs)    # All run concurrently
coverage = aggregate(results)       # Fan-in: single merge job

Related Pages

Implementation:Apache_Hudi_Azure_Pipelines_CI_Configuration

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment