Principle:Apache Hudi Compaction Plan Generation
| Knowledge Sources | |
|---|---|
| Domains | Data_Lake, Stream_Processing |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Selecting which pending compaction plans to execute from a set of candidates on the timeline using configurable filtering strategies.
Description
Once compaction has been scheduled on a Hudi MOR table, compaction plan instants accumulate on the active timeline in a requested state. A running compactor -- whether inline, async, or standalone -- must decide which of these pending plans to pick up for execution during any given compaction cycle. This decision is not trivial in production environments where multiple compaction plans may be pending simultaneously due to processing delays, backpressure, or operator restarts.
The Compaction Plan Generation principle addresses this selection problem through a strategy pattern. Rather than hardcoding a single selection approach, the system defines a pluggable interface that accepts the pending compaction timeline and returns an ordered subset of instants to compact. Three built-in strategies cover the most common operational needs:
- All -- Execute every pending compaction plan. This is the most aggressive approach, suitable for catch-up scenarios or batch environments where all outstanding work should be cleared.
- Specific instants -- Execute only compaction plans matching a user-supplied comma-separated list of instant timestamps. This is useful for targeted maintenance or debugging, where an operator wants to compact specific file groups.
- Number of instants -- Execute up to N pending plans, selected in either FIFO (oldest first) or LIFO (newest first) order. This is the default strategy (with N=1), providing controlled, incremental compaction that avoids overwhelming system resources.
The ordering control (FIFO vs. LIFO) is critical for operational flexibility. FIFO ordering ensures that the oldest pending compactions are cleared first, preventing unbounded log file accumulation. LIFO ordering prioritizes the most recent compactions, which can be useful when recent data is queried more frequently and should be read-optimized first.
Usage
Apply this principle when:
- Configuring a standalone compactor job: Choose the strategy based on operational needs --
num_instantsfor steady-state compaction,allfor catch-up after an outage,instantsfor targeted maintenance. - Tuning compaction throughput: Adjust
maxNumCompactionPlansto control how many plans are executed per cycle, balancing compaction throughput against cluster resource usage. - Managing compaction ordering: Set the sequence to FIFO for age-based priority or LIFO for recency-based priority.
Theoretical Basis
The compaction plan selection problem is an instance of work scheduling in a multi-version storage system. Each pending compaction plan represents a unit of deferred work (merging delta logs into base files), and the scheduler must choose which units to execute.
The Pending Timeline Model
The Hudi timeline maintains compaction instants in an ordered sequence:
Timeline: [C1:requested, C2:requested, C3:requested, C4:requested]
^oldest ^newest
Each instant Cn represents a compaction plan containing a set of file group operations (merge log files into base file).
Strategy Selection Function
The selection is modeled as a function from timeline to a filtered list:
FUNCTION selectPlans(timeline, strategy, config):
CASE strategy OF:
"all":
RETURN timeline.getAllInstants()
"instants":
target_set = PARSE(config.instantList)
RETURN FILTER(timeline, instant -> instant.time IN target_set)
"num_instants":
ordered = IF config.sequence == LIFO:
REVERSE(timeline.getAllInstants())
ELSE:
timeline.getAllInstants() // FIFO by default
RETURN TAKE(ordered, MIN(config.maxPlans, LENGTH(ordered)))
Ordering and Fairness
The FIFO/LIFO choice creates different fairness properties:
- FIFO (First-In-First-Out): Guarantees bounded staleness. If compaction plans are generated at rate R and consumed at rate C >= R, no plan waits longer than N/C time units, where N is the max number of plans per cycle. This prevents starvation of older file groups.
- LIFO (Last-In-First-Out): Optimizes for temporal locality in read patterns. If most queries access recent data, compacting newest plans first ensures that the hot partition of the data is always read-optimized, at the cost of potentially unbounded staleness for older plans.
In practice, FIFO with maxNumCompactionPlans=1 provides a safe default that processes compaction plans in strict chronological order.