Workflow:Apache Hudi Flink MOR Compaction
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, Data_Lake, Table_Management |
| Last Updated | 2026-02-08 20:00 GMT |
Overview
End-to-end process for compacting Merge-on-Read (MOR) Hudi tables in Apache Flink, merging delta log files into base Parquet files to optimize read performance.
Description
This workflow covers the asynchronous compaction process for MOR Hudi tables using the Flink compactor. MOR tables write updates as delta log files for fast ingestion, but these accumulate over time and degrade read performance. Compaction merges these log files with their corresponding base Parquet files to produce new, optimized base files. The Flink compactor can run inline (within the write pipeline), asynchronously (as a separate Flink job), or on a schedule. It uses configurable plan strategies to select which file groups to compact and supports both eager and lazy compaction modes.
Usage
Execute this workflow when you have a MOR Hudi table with accumulated delta log files that are degrading read performance. Compaction is required to convert MOR table data into an optimized columnar format. Run compaction regularly for tables with high update rates, or trigger it when the read-optimized query latency becomes unacceptable.
Execution Steps
Step 1: Assess Compaction Need
Evaluate the current state of the MOR table to determine if compaction is needed. Check the number of pending log files per file group, the size of accumulated delta logs, and the number of uncompacted commits. The CompactionUtil provides utility methods to check if compaction should be scheduled based on configured thresholds.
Key considerations:
- Monitor the ratio of log file size to base file size per file group
- Track the number of delta commits since the last compaction
- Consider the impact on concurrent readers during compaction
Step 2: Generate Compaction Plan
Create a compaction plan that identifies which file groups need compaction. The CompactionPlanStrategy selects file groups based on configurable criteria: number of log files, total log size, time since last compaction, or bounded I/O budget. The plan is persisted as a Hudi instant on the timeline for crash recovery.
Key considerations:
- BoundedIOCompactionStrategy limits the total I/O per compaction run
- LogFileSizeBasedCompactionStrategy prioritizes file groups with the most log data
- UnBoundedCompactionStrategy compacts all eligible file groups
- The plan is atomic and can be retried if the compaction job fails
Step 3: Configure Flink Compactor
Set up the Flink compaction job with appropriate parallelism and resource allocation. Configure the compaction pipeline using the HoodieFlinkCompactor which orchestrates the plan reading, file group compaction, and commit phases. For inline compaction, these are woven into the existing write pipeline.
Key considerations:
- Compaction parallelism affects resource usage and completion time
- Inline compaction adds latency to the write pipeline but avoids a separate job
- Async compaction runs independently but needs coordination with writers
- Configure memory limits to prevent OOM during large merges
Step 4: Execute Compaction
Run the compaction job which reads each file group's base Parquet file and delta log files, merges them using the configured merge strategy, and writes new base Parquet files. The CompactionCommitSink collects write results from all compaction tasks and commits the compaction instant atomically.
Key considerations:
- Each file group is compacted independently and in parallel
- The merge preserves the latest version of each record based on the precombine field
- New base files replace old base + log file combinations
- Failed compaction tasks can be retried without data loss
Step 5: Commit and Clean
After all file groups are compacted, commit the compaction result to the Hudi timeline. This atomically transitions the compaction instant from inflight to completed. Post-compaction, the cleaner service removes old base files and log files that are no longer referenced by any active query or pending compaction.
Key considerations:
- The commit makes compacted data visible to new snapshot queries
- Cleaning is configured by retention policy (number of commits or time-based)
- Active readers using older file versions are not affected until they complete
- Verify the compaction succeeded by checking the timeline for completed instants