Principle:Apache Hudi Compaction Need Assessment

Knowledge Sources	Apache Hudi
Domains	Data_Lake, Stream_Processing
Last Updated	2026-02-08 00:00 GMT

Overview

Determining whether an asynchronous compaction sub-pipeline should be wired into a streaming write job based on table type and configuration.

Description

In a Merge-On-Read (MOR) data lake architecture, new data is first written as delta log files alongside existing base (columnar) files. Over time, these log files accumulate and degrade read performance because readers must merge logs with base files at query time. Compaction is the background process that consolidates delta logs into new base files, restoring read performance to near-parity with Copy-On-Write tables.

Before any compaction work can begin, the system must make a binary decision: does this particular write pipeline need compaction at all? This decision depends on two factors:

Table type -- Only MOR tables produce delta log files. Copy-On-Write (COW) tables write directly to base files, so compaction is meaningless for them.
Async compaction enablement -- Even for MOR tables, the operator may choose to run compaction as a separate offline job rather than inline with the streaming write pipeline. When async compaction is disabled in the configuration, the write pipeline should not wire in the compaction sub-pipeline.

The Compaction Need Assessment principle encapsulates this guard check. It is the first gate in the compaction workflow: if the assessment returns false, no compaction operators are added to the Flink DAG, saving resources and avoiding unnecessary scheduling overhead.

Usage

Apply this principle at pipeline construction time, before any compaction operators are instantiated. It is evaluated once during job setup (or whenever configuration is refreshed) and determines the overall shape of the Flink execution graph. Specifically:

Streaming write pipelines: Check during DAG construction whether to add a CompactionPlanOperator and CompactOperator downstream of the write operator.
Standalone compactor jobs: This check is implicit since the standalone compactor always targets MOR tables.
Configuration validation: Use this check in tests and configuration validators to provide early feedback when compaction settings conflict with the table type.

Theoretical Basis

The need for compaction arises from the fundamental trade-off in Log-Structured Merge (LSM) tree designs applied to data lake table formats.

The MOR Write Path

In a MOR table, the write path is optimized for low-latency ingestion:

WRITE(record) ->
  IF base_file_exists(file_group):
    APPEND record TO delta_log_file(file_group)
  ELSE:
    CREATE new_base_file(file_group, record)

This design yields excellent write throughput because appending to a log file is sequential I/O, but it defers the cost of producing a merged, read-optimized base file.

The Read Amplification Problem

Without compaction, every read must:

READ(file_group) ->
  base_records = READ(base_file)
  FOR EACH log_file IN delta_log_files(file_group) ORDER BY timestamp:
    base_records = MERGE(base_records, log_file)
  RETURN base_records

As the number of log files grows, read latency increases linearly. This is read amplification -- the penalty for deferring merge work during writes.

The Decision Function

The compaction need assessment is a simple predicate:

FUNCTION needsCompaction(config):
  RETURN config.tableType == MERGE_ON_READ
     AND config.asyncCompactionEnabled == TRUE

Both conditions must hold:

tableType == MOR: COW tables do not produce log files, so there is nothing to compact.
asyncCompactionEnabled: When disabled, the operator has chosen to manage compaction externally (e.g., via a standalone compactor job or manual trigger), so the streaming pipeline should not schedule it.

This is a pure configuration-driven decision with no runtime state dependencies, making it safe to evaluate at pipeline construction time.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment