Implementation:Apache Hudi Azure Pipelines CI Configuration
| Knowledge Sources | |
|---|---|
| Domains | CI_CD, Testing, Code_Coverage |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Concrete CI pipeline definition for running Apache Hudi's unit and functional test suites across 10 parallel Azure DevOps jobs with aggregated JaCoCo code coverage reporting.
Description
The azure-pipelines-20230430.yml file defines the primary continuous integration pipeline for the Apache Hudi project on Azure DevOps. It orchestrates 10 parallel test jobs (UT_FT_1 through UT_FT_10) that split the project's test suite across Hudi's core modules, including hadoop-common, spark-client, spark-datasource (Java and Scala tests), Hudi Streamer/utilities, and other common modules. The pipeline builds with Spark 3.5 and Flink 1.18 profiles using Scala 2.12. A final MergeAndPublishCoverage job aggregates JaCoCo execution data files from all 10 test jobs into a unified code coverage report.
Two jobs (UT_FT_7 and UT_FT_10) run inside Docker containers using the apachehudi/hudi-ci-bundle-validation-base image, while the remaining jobs run directly on the Azure-hosted Ubuntu 22.04 agent with Maven 4 tasks.
Usage
This pipeline triggers automatically on all branch pushes. It is the primary quality gate for pull requests and commits to the Apache Hudi repository. Contributors should understand this configuration when:
- Debugging CI failures on specific test jobs
- Adding new modules that need test coverage
- Modifying test profiles or Maven build arguments
- Understanding which tests run in which parallel job
- Investigating code coverage gaps in the aggregated JaCoCo report
Code Reference
Source Location
- Repository: Apache_Hudi
- File: azure-pipelines-20230430.yml
- Lines: 1-599
Configuration Structure
# Top-level pipeline structure
trigger:
branches:
include:
- '*'
pool:
vmImage: 'ubuntu-22.04'
parameters:
- name: job3456UTModules # Spark datasource modules for jobs 3-6
- name: job10UTModules # Exclusion list for job 10 unit tests
- name: job10FTModules # Exclusion list for job 10 functional tests
- name: job6HudiSparkDdlOthersWildcardSuites # Scala test suites for job 6
- name: jacocoModules # Modules excluded from coverage aggregation
variables:
BUILD_PROFILES: '-Dscala-2.12 -Dspark3.5 -Dflink1.18'
PLUGIN_OPTS: '-Dcheckstyle.skip=true -Drat.skip=true ...'
MVN_OPTS_INSTALL: '-T 3 -Phudi-platform-service -DskipTests ...'
MVN_OPTS_TEST: '-fae -Pwarn-log ...'
stages:
- stage: test
jobs:
- job: UT_FT_1 through UT_FT_10 # 10 parallel test jobs
- job: MergeAndPublishCoverage # Aggregation job (depends on all 10)
Import
# No import needed — this file is consumed by Azure DevOps automatically
# when placed at the repository root and configured as a pipeline.
# Reference in Azure DevOps project settings:
# Pipeline source: azure-pipelines-20230430.yml
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| trigger | Branch filter | Yes | Triggers on all branches via wildcard * |
| BUILD_PROFILES | Maven profiles | Yes | -Dscala-2.12 -Dspark3.5 -Dflink1.18 — selects Scala, Spark, and Flink versions |
| job3456UTModules | List of module paths | Yes | Spark datasource modules tested in jobs 3-6 |
| job10UTModules | List of exclusion patterns | Yes | Modules excluded from job 10 unit tests (tested elsewhere) |
| job10FTModules | List of exclusion patterns | Yes | Modules excluded from job 10 functional tests |
| jacocoModules | List of exclusion patterns | Yes | Packaging/example modules excluded from coverage aggregation |
| Docker registry | Container registry | Yes | apachehudi-docker-hub for jobs 7 and 10 |
Outputs
| Name | Type | Description |
|---|---|---|
| JUnit test results | XML files | Published via PublishTestResults or Maven JUnit publisher per job |
| JaCoCo execution data | .exec files | Per-job merged JaCoCo execution data published as build artifacts |
| Aggregated coverage report | XML + HTML | Final jacoco-report.xml and jacoco-html-report published by MergeAndPublishCoverage |
| Top 100 long-running tests | Console output | Sorted list of slowest test cases displayed per job |
Usage Examples
Job Distribution Overview
# Job 1: hadoop-common unit tests + spark-client unit/functional tests
# Job 2: hudi-spark functional tests (FTA)
# Job 3: spark-datasource Java unit tests (functional package)
# Job 4: spark-datasource Java unit tests (non-functional package)
# Job 5: spark-datasource Scala DML tests
# Job 6: spark-datasource Scala DDL & Others tests
# Job 7: Hudi Streamer unit tests + utilities functional tests (Docker)
# Job 8: spark-datasource Scala SQL features + DML insert + FTC tests
# Job 9: spark FTB functional tests
# Job 10: Common modules + remaining utilities (Docker)
# MergeAndPublishCoverage: Aggregates all .exec files into final report
Adding a New Module to CI
# To add a new module to the test pipeline:
# 1. If it should run in an existing job, add to appropriate parameter list
# 2. If it should be excluded from job 10, add to job10UTModules/job10FTModules
parameters:
- name: job10UTModules
type: object
default:
- '!hudi-hadoop-common'
- '!hudi-client/hudi-spark-client'
# Add exclusion for your new module if tested elsewhere:
- '!hudi-new-module'
JaCoCo Coverage Pipeline
# Each test job runs these steps after tests complete:
# 1. Download JaCoCo CLI
./scripts/jacoco/download_jacoco.sh
# 2. Merge per-module .exec files into merged-jacoco.exec
./scripts/jacoco/merge_jacoco_exec_files.sh \
jacoco-lib/lib/jacococli.jar $(Build.SourcesDirectory)
# 3. Publish as build artifact: merged-jacoco-{BuildId}-{JobNumber}
# Final aggregation job merges all per-job files:
./scripts/jacoco/merge_jacoco_job_files.sh \
jacoco-lib/lib/jacococli.jar $(System.ArtifactsDirectory) $(Build.SourcesDirectory)
# Generate HTML+XML report:
./scripts/jacoco/generate_jacoco_coverage_report.sh \
jacoco-lib/lib/jacococli.jar $(Build.SourcesDirectory)