Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Apache Hudi Compaction Plan Generation

From Leeroopedia


Knowledge Sources
Domains Data_Lake, Stream_Processing
Last Updated 2026-02-08 00:00 GMT

Overview

Selecting which pending compaction plans to execute from a set of candidates on the timeline using configurable filtering strategies.

Description

Once compaction has been scheduled on a Hudi MOR table, compaction plan instants accumulate on the active timeline in a requested state. A running compactor -- whether inline, async, or standalone -- must decide which of these pending plans to pick up for execution during any given compaction cycle. This decision is not trivial in production environments where multiple compaction plans may be pending simultaneously due to processing delays, backpressure, or operator restarts.

The Compaction Plan Generation principle addresses this selection problem through a strategy pattern. Rather than hardcoding a single selection approach, the system defines a pluggable interface that accepts the pending compaction timeline and returns an ordered subset of instants to compact. Three built-in strategies cover the most common operational needs:

  1. All -- Execute every pending compaction plan. This is the most aggressive approach, suitable for catch-up scenarios or batch environments where all outstanding work should be cleared.
  2. Specific instants -- Execute only compaction plans matching a user-supplied comma-separated list of instant timestamps. This is useful for targeted maintenance or debugging, where an operator wants to compact specific file groups.
  3. Number of instants -- Execute up to N pending plans, selected in either FIFO (oldest first) or LIFO (newest first) order. This is the default strategy (with N=1), providing controlled, incremental compaction that avoids overwhelming system resources.

The ordering control (FIFO vs. LIFO) is critical for operational flexibility. FIFO ordering ensures that the oldest pending compactions are cleared first, preventing unbounded log file accumulation. LIFO ordering prioritizes the most recent compactions, which can be useful when recent data is queried more frequently and should be read-optimized first.

Usage

Apply this principle when:

  • Configuring a standalone compactor job: Choose the strategy based on operational needs -- num_instants for steady-state compaction, all for catch-up after an outage, instants for targeted maintenance.
  • Tuning compaction throughput: Adjust maxNumCompactionPlans to control how many plans are executed per cycle, balancing compaction throughput against cluster resource usage.
  • Managing compaction ordering: Set the sequence to FIFO for age-based priority or LIFO for recency-based priority.

Theoretical Basis

The compaction plan selection problem is an instance of work scheduling in a multi-version storage system. Each pending compaction plan represents a unit of deferred work (merging delta logs into base files), and the scheduler must choose which units to execute.

The Pending Timeline Model

The Hudi timeline maintains compaction instants in an ordered sequence:

Timeline: [C1:requested, C2:requested, C3:requested, C4:requested]
                ^oldest                                    ^newest

Each instant Cn represents a compaction plan containing a set of file group operations (merge log files into base file).

Strategy Selection Function

The selection is modeled as a function from timeline to a filtered list:

FUNCTION selectPlans(timeline, strategy, config):
  CASE strategy OF:
    "all":
      RETURN timeline.getAllInstants()

    "instants":
      target_set = PARSE(config.instantList)
      RETURN FILTER(timeline, instant -> instant.time IN target_set)

    "num_instants":
      ordered = IF config.sequence == LIFO:
                  REVERSE(timeline.getAllInstants())
                ELSE:
                  timeline.getAllInstants()  // FIFO by default
      RETURN TAKE(ordered, MIN(config.maxPlans, LENGTH(ordered)))

Ordering and Fairness

The FIFO/LIFO choice creates different fairness properties:

  • FIFO (First-In-First-Out): Guarantees bounded staleness. If compaction plans are generated at rate R and consumed at rate C >= R, no plan waits longer than N/C time units, where N is the max number of plans per cycle. This prevents starvation of older file groups.
  • LIFO (Last-In-First-Out): Optimizes for temporal locality in read patterns. If most queries access recent data, compacting newest plans first ensures that the hot partition of the data is always read-optimized, at the cost of potentially unbounded staleness for older plans.

In practice, FIFO with maxNumCompactionPlans=1 provides a safe default that processes compaction plans in strict chronological order.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment