Workflow:Apache Hudi Flink MOR Compaction

Knowledge Sources	Apache Hudi Hudi Compaction Hudi Table Services
Domains	Data_Engineering, Data_Lake, Table_Management
Last Updated	2026-02-08 20:00 GMT

Overview

End-to-end process for compacting Merge-on-Read (MOR) Hudi tables in Apache Flink, merging delta log files into base Parquet files to optimize read performance.

Description

This workflow covers the asynchronous compaction process for MOR Hudi tables using the Flink compactor. MOR tables write updates as delta log files for fast ingestion, but these accumulate over time and degrade read performance. Compaction merges these log files with their corresponding base Parquet files to produce new, optimized base files. The Flink compactor can run inline (within the write pipeline), asynchronously (as a separate Flink job), or on a schedule. It uses configurable plan strategies to select which file groups to compact and supports both eager and lazy compaction modes.

Usage

Execute this workflow when you have a MOR Hudi table with accumulated delta log files that are degrading read performance. Compaction is required to convert MOR table data into an optimized columnar format. Run compaction regularly for tables with high update rates, or trigger it when the read-optimized query latency becomes unacceptable.

Execution Steps

Step 1: Assess Compaction Need

Evaluate the current state of the MOR table to determine if compaction is needed. Check the number of pending log files per file group, the size of accumulated delta logs, and the number of uncompacted commits. The CompactionUtil provides utility methods to check if compaction should be scheduled based on configured thresholds.

Key considerations:

Monitor the ratio of log file size to base file size per file group
Track the number of delta commits since the last compaction
Consider the impact on concurrent readers during compaction

Step 2: Generate Compaction Plan

Create a compaction plan that identifies which file groups need compaction. The CompactionPlanStrategy selects file groups based on configurable criteria: number of log files, total log size, time since last compaction, or bounded I/O budget. The plan is persisted as a Hudi instant on the timeline for crash recovery.

Key considerations:

BoundedIOCompactionStrategy limits the total I/O per compaction run
LogFileSizeBasedCompactionStrategy prioritizes file groups with the most log data
UnBoundedCompactionStrategy compacts all eligible file groups
The plan is atomic and can be retried if the compaction job fails

Step 3: Configure Flink Compactor

Set up the Flink compaction job with appropriate parallelism and resource allocation. Configure the compaction pipeline using the HoodieFlinkCompactor which orchestrates the plan reading, file group compaction, and commit phases. For inline compaction, these are woven into the existing write pipeline.

Key considerations:

Compaction parallelism affects resource usage and completion time
Inline compaction adds latency to the write pipeline but avoids a separate job
Async compaction runs independently but needs coordination with writers
Configure memory limits to prevent OOM during large merges

Step 4: Execute Compaction

Run the compaction job which reads each file group's base Parquet file and delta log files, merges them using the configured merge strategy, and writes new base Parquet files. The CompactionCommitSink collects write results from all compaction tasks and commits the compaction instant atomically.

Key considerations:

Each file group is compacted independently and in parallel
The merge preserves the latest version of each record based on the precombine field
New base files replace old base + log file combinations
Failed compaction tasks can be retried without data loss

Step 5: Commit and Clean

After all file groups are compacted, commit the compaction result to the Hudi timeline. This atomically transitions the compaction instant from inflight to completed. Post-compaction, the cleaner service removes old base files and log files that are no longer referenced by any active query or pending compaction.

Key considerations:

The commit makes compacted data visible to new snapshot queries
Cleaning is configured by retention policy (number of commits or time-based)
Active readers using older file versions are not affected until they complete
Verify the compaction succeeded by checking the timeline for completed instants

Execution Diagram

GitHub URL

Workflow Repository