Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Apache Hudi Compaction Commit

From Leeroopedia


Knowledge Sources
Domains Data_Lake, Stream_Processing
Last Updated 2026-02-08 00:00 GMT

Overview

Collecting compaction results from all parallel tasks, determining completeness, and atomically committing or rolling back the compaction instant on the Hudi timeline.

Description

After individual file group compaction operations complete across multiple parallel Flink tasks, their results must be aggregated and committed as a single atomic operation on the Hudi timeline. The Compaction Commit principle governs this final phase of the compaction workflow, addressing the challenge of coordinating a distributed set of independent compaction tasks into a single consistent commit.

The commit phase has several critical responsibilities:

  1. Event buffering and completeness detection: As compaction tasks complete, they emit CompactionCommitEvent objects containing write statuses (or null for failures). These events arrive at a single sink operator (parallelism=1) which buffers them by instant time and file ID. A compaction instant is ready for commit when the buffer contains exactly as many events as there are operations in the original compaction plan.
  1. Failure handling: If any compaction task reported a failure (null write statuses) or if the total error record count exceeds zero (with IGNORE_FAILED disabled), the entire compaction instant is rolled back. This all-or-nothing semantics ensures that partial compaction results do not corrupt the table state.
  1. Atomic commit: When all operations succeed, the handler creates HoodieCommitMetadata from the aggregated write statuses and calls writeClient.completeCompaction() to transition the compaction instant from inflight to committed on the timeline. This is an atomic operation that makes the compacted base files visible to readers.
  1. Post-commit cleanup: After a successful compaction commit, if asynchronous cleaning is disabled (CLEAN_ASYNC_ENABLED=false), the handler triggers a synchronous writeClient.clean() call to remove old base files and log files that are no longer needed. This reclaims storage space and prevents unbounded growth of historical file versions.

Usage

Apply this principle when:

  • Understanding compaction atomicity: The commit phase is where the all-or-nothing guarantee of compaction is enforced. A compaction is either fully committed (all file groups merged and new base files visible) or fully rolled back (no changes visible to readers).
  • Debugging commit failures: If compaction completes but commits fail, check for error records in write statuses, timeline conflicts, or storage permission issues.
  • Configuring post-compaction cleanup: The CLEAN_ASYNC_ENABLED flag controls whether cleaning runs as part of the commit phase. Disabling async cleaning means the compaction commit sink handles cleanup synchronously, which adds latency but ensures cleanup always runs.

Theoretical Basis

The compaction commit implements a barrier synchronization pattern followed by an atomic state transition on the Hudi timeline.

Barrier Synchronization Model

The commit buffer acts as a barrier that waits for all parallel tasks to report:

STATE: commitBuffer = Map<InstantTime, Map<FileID, CompactionCommitEvent>>

ON RECEIVE(event):
  commitBuffer[event.instant][event.fileId] = event
  plan = GET_PLAN(event.instant)

  IF SIZE(commitBuffer[event.instant]) == SIZE(plan.operations):
    // Barrier reached: all tasks reported
    IF ANY event IN commitBuffer[event.instant] IS_FAILED:
      ROLLBACK(event.instant)
    ELSE:
      COMMIT(event.instant)
    CLEAR commitBuffer[event.instant]

This is a counting barrier -- it does not require explicit synchronization between tasks, only a count of completed events matching the expected count from the compaction plan.

Commit Atomicity

The commit itself is atomic because it transitions a single instant on the Hudi timeline:

FUNCTION commit(instant, writeStatuses):
  metadata = CREATE_COMMIT_METADATA(writeStatuses)
  // Atomic timeline transition: inflight -> committed
  timeline.completeCompaction(instant, metadata)
  // After this point, new base files are visible to readers
  // Old log files are still present until cleaned

The key invariant is that the timeline transition is atomic -- readers see either the old state (base + logs) or the new state (new base file), never a partial state.

Rollback Semantics

When a compaction must be rolled back:

FUNCTION rollback(instant):
  // Transition: inflight -> (removed from timeline)
  // Any new base files written by compaction tasks are orphaned
  // and will be cleaned up by subsequent cleaning operations
  timeline.rollbackCompaction(instant)

Rollback removes the compaction instant from the timeline, effectively making it as if the compaction never happened. Readers continue to merge base files with log files as before.

Post-Commit Cleaning

After a successful commit, old file versions become eligible for cleaning:

FUNCTION postCommitClean(config):
  IF NOT config.cleanAsyncEnabled:
    // Synchronous clean: remove old base files and log files
    // that are no longer referenced by any active file group version
    writeClient.clean()

The cleaning policy (KEEP_LATEST_COMMITS, KEEP_LATEST_FILE_VERSIONS, KEEP_LATEST_BY_HOURS) determines which old versions are retained for time-travel queries versus deleted for space reclamation.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment