Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Apache Hudi Azure Pipelines CI Configuration

From Leeroopedia


Knowledge Sources
Domains CI_CD, Testing, Code_Coverage
Last Updated 2026-02-08 00:00 GMT

Overview

Concrete CI pipeline definition for running Apache Hudi's unit and functional test suites across 10 parallel Azure DevOps jobs with aggregated JaCoCo code coverage reporting.

Description

The azure-pipelines-20230430.yml file defines the primary continuous integration pipeline for the Apache Hudi project on Azure DevOps. It orchestrates 10 parallel test jobs (UT_FT_1 through UT_FT_10) that split the project's test suite across Hudi's core modules, including hadoop-common, spark-client, spark-datasource (Java and Scala tests), Hudi Streamer/utilities, and other common modules. The pipeline builds with Spark 3.5 and Flink 1.18 profiles using Scala 2.12. A final MergeAndPublishCoverage job aggregates JaCoCo execution data files from all 10 test jobs into a unified code coverage report.

Two jobs (UT_FT_7 and UT_FT_10) run inside Docker containers using the apachehudi/hudi-ci-bundle-validation-base image, while the remaining jobs run directly on the Azure-hosted Ubuntu 22.04 agent with Maven 4 tasks.

Usage

This pipeline triggers automatically on all branch pushes. It is the primary quality gate for pull requests and commits to the Apache Hudi repository. Contributors should understand this configuration when:

  • Debugging CI failures on specific test jobs
  • Adding new modules that need test coverage
  • Modifying test profiles or Maven build arguments
  • Understanding which tests run in which parallel job
  • Investigating code coverage gaps in the aggregated JaCoCo report

Code Reference

Source Location

Configuration Structure

# Top-level pipeline structure
trigger:
  branches:
    include:
      - '*'

pool:
  vmImage: 'ubuntu-22.04'

parameters:
  - name: job3456UTModules      # Spark datasource modules for jobs 3-6
  - name: job10UTModules         # Exclusion list for job 10 unit tests
  - name: job10FTModules         # Exclusion list for job 10 functional tests
  - name: job6HudiSparkDdlOthersWildcardSuites  # Scala test suites for job 6
  - name: jacocoModules          # Modules excluded from coverage aggregation

variables:
  BUILD_PROFILES: '-Dscala-2.12 -Dspark3.5 -Dflink1.18'
  PLUGIN_OPTS: '-Dcheckstyle.skip=true -Drat.skip=true ...'
  MVN_OPTS_INSTALL: '-T 3 -Phudi-platform-service -DskipTests ...'
  MVN_OPTS_TEST: '-fae -Pwarn-log ...'

stages:
  - stage: test
    jobs:
      - job: UT_FT_1 through UT_FT_10   # 10 parallel test jobs
      - job: MergeAndPublishCoverage     # Aggregation job (depends on all 10)

Import

# No import needed — this file is consumed by Azure DevOps automatically
# when placed at the repository root and configured as a pipeline.
# Reference in Azure DevOps project settings:
#   Pipeline source: azure-pipelines-20230430.yml

I/O Contract

Inputs

Name Type Required Description
trigger Branch filter Yes Triggers on all branches via wildcard *
BUILD_PROFILES Maven profiles Yes -Dscala-2.12 -Dspark3.5 -Dflink1.18 — selects Scala, Spark, and Flink versions
job3456UTModules List of module paths Yes Spark datasource modules tested in jobs 3-6
job10UTModules List of exclusion patterns Yes Modules excluded from job 10 unit tests (tested elsewhere)
job10FTModules List of exclusion patterns Yes Modules excluded from job 10 functional tests
jacocoModules List of exclusion patterns Yes Packaging/example modules excluded from coverage aggregation
Docker registry Container registry Yes apachehudi-docker-hub for jobs 7 and 10

Outputs

Name Type Description
JUnit test results XML files Published via PublishTestResults or Maven JUnit publisher per job
JaCoCo execution data .exec files Per-job merged JaCoCo execution data published as build artifacts
Aggregated coverage report XML + HTML Final jacoco-report.xml and jacoco-html-report published by MergeAndPublishCoverage
Top 100 long-running tests Console output Sorted list of slowest test cases displayed per job

Usage Examples

Job Distribution Overview

# Job 1: hadoop-common unit tests + spark-client unit/functional tests
# Job 2: hudi-spark functional tests (FTA)
# Job 3: spark-datasource Java unit tests (functional package)
# Job 4: spark-datasource Java unit tests (non-functional package)
# Job 5: spark-datasource Scala DML tests
# Job 6: spark-datasource Scala DDL & Others tests
# Job 7: Hudi Streamer unit tests + utilities functional tests (Docker)
# Job 8: spark-datasource Scala SQL features + DML insert + FTC tests
# Job 9: spark FTB functional tests
# Job 10: Common modules + remaining utilities (Docker)
# MergeAndPublishCoverage: Aggregates all .exec files into final report

Adding a New Module to CI

# To add a new module to the test pipeline:
# 1. If it should run in an existing job, add to appropriate parameter list
# 2. If it should be excluded from job 10, add to job10UTModules/job10FTModules

parameters:
  - name: job10UTModules
    type: object
    default:
      - '!hudi-hadoop-common'
      - '!hudi-client/hudi-spark-client'
      # Add exclusion for your new module if tested elsewhere:
      - '!hudi-new-module'

JaCoCo Coverage Pipeline

# Each test job runs these steps after tests complete:
# 1. Download JaCoCo CLI
./scripts/jacoco/download_jacoco.sh

# 2. Merge per-module .exec files into merged-jacoco.exec
./scripts/jacoco/merge_jacoco_exec_files.sh \
  jacoco-lib/lib/jacococli.jar $(Build.SourcesDirectory)

# 3. Publish as build artifact: merged-jacoco-{BuildId}-{JobNumber}

# Final aggregation job merges all per-job files:
./scripts/jacoco/merge_jacoco_job_files.sh \
  jacoco-lib/lib/jacococli.jar $(System.ArtifactsDirectory) $(Build.SourcesDirectory)

# Generate HTML+XML report:
./scripts/jacoco/generate_jacoco_coverage_report.sh \
  jacoco-lib/lib/jacococli.jar $(Build.SourcesDirectory)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment