Heuristic:Apache Hudi Compaction Scheduling Safety

Knowledge Sources	Apache Hudi FlinkCompactionConfig schedule parameter warning
Domains	Optimization, Data_Integrity
Last Updated	2026-02-08 20:00 GMT

Overview

Compaction plans should be scheduled by the Hudi writer job, not by standalone compaction jobs, to prevent data loss risk.

Description

Apache Hudi MOR (Merge-on-Read) tables require periodic compaction to merge delta log files into base Parquet files. The compaction plan (which log files to compact) can be scheduled either inline within the streaming writer job or externally by a standalone compaction job. The Hudi codebase explicitly warns that scheduling compaction outside the writer job carries a risk of data loss, because the external scheduler may not have a consistent view of in-flight writes. The recommended pattern is to let the writer job schedule compaction plans and use the standalone compaction job only for execution of those plans.

Usage

Apply this heuristic when configuring MOR table compaction in Flink. If you are running a standalone HoodieFlinkCompactor, set --schedule false (the default) and rely on the writer job to schedule compaction plans. Only enable --schedule true if you understand and accept the data loss risk.

The Insight (Rule of Thumb)

Action: Keep --schedule false (default) on standalone compaction jobs. Let the writer job handle scheduling.
Value: Writer job uses compaction.delta_commits=5 (default) to schedule compaction after every 5 commits.
Trade-off: Inline scheduling adds slight overhead to the writer job but guarantees consistency with in-flight writes.
Exception: If you must schedule externally, also set --job-max-processing-time-ms for the retry mechanism to function (otherwise --retry-last-failed-job is silently ineffective).

Reasoning

The writer job has exclusive knowledge of which commits are in-flight and which log files are being actively written. A standalone compaction scheduler lacks this view and may schedule compaction of files that are still being written, leading to data corruption or loss. By co-locating scheduling with the writer, the compaction plan is generated atomically with the write commit, ensuring consistency.

Additionally, both the compaction and clustering standalone jobs have a --retry-last-failed-job flag that silently does nothing unless --job-max-processing-time-ms is set to a positive value, creating a configuration trap where retries appear enabled but are inactive.

Code Evidence

Compaction scheduling warning from FlinkCompactionConfig.java:122-126:

@Parameter(names = {"--schedule", "-sc"}, description = "Not recommended. Schedule the compaction plan in this job.\n"
    + "There is a risk of losing data when scheduling compaction outside the writer job.\n"
    + "Scheduling compaction in the writer job and only let this job do the compaction execution is recommended.\n"
    + "Default is false")
public Boolean schedule = false;

Retry mechanism warning from HoodieFlinkCompactor.java:80-82:

LOG.warn("--retry-last-failed-job is enabled but --job-max-processing-time-ms is not set or <= 0. "
    + "The retry-last-failed feature will have no effect.");

Compaction trigger defaults from FlinkOptions.java:937-941:

public static final ConfigOption<Integer> COMPACTION_DELTA_COMMITS = ConfigOptions
    .key("compaction.delta_commits")
    .intType()
    .defaultValue(5)
    .withDescription("Max delta commits needed to trigger compaction, default 5 commits");

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment