Principle:Apache Hudi Write Result Verification
| Knowledge Sources | |
|---|---|
| Domains | Data_Lake, Stream_Processing |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Write Result Verification is the principle of validating write operation outcomes by inspecting write statuses for errors and confirming that commits have been successfully recorded on the Hudi timeline.
Description
After each batch of records is written to the storage layer, the streaming write pipeline produces a list of WriteStatus objects that describe the outcome of each file write. Write result verification inspects these statuses to detect failures early, before the commit is attempted.
This principle encompasses two complementary checks:
- Write status validation: Scanning the
WriteStatuslist for any records that failed to write. When theWRITE_FAIL_FASToption is enabled, the system immediately throws an exception if any write errors are detected, preventing a checkpoint from succeeding with partial data loss. This early detection is critical because a failed write discovered only at commit time (after the checkpoint barrier) could result in silent data loss. - Commit verification: Checking the Hudi timeline to confirm that successful commits exist. This is used during startup (to determine if the table has any data) and during recovery (to decide whether to recommit an inflight instant).
The fail-fast approach is particularly important in streaming scenarios:
- Without it, a write error in one task might go unnoticed until the coordinator tries to commit, by which point the checkpoint has already been triggered and the data buffer has been cleared.
- With fail-fast enabled, the task throws immediately upon detecting errors, causing the checkpoint to fail and the job to restart from the last successful checkpoint, preserving exactly-once semantics.
Usage
Use this principle in all Hudi streaming write pipelines. The write status validation is called after each bucket flush (in the write function), and the commit verification is used by the coordinator during startup and recovery. Key scenarios include:
- Fail-fast error detection: Enable
WRITE_FAIL_FASTto immediately surface write errors rather than allowing them to propagate to commit time. - Table existence checks: Use
haveSuccessfulCommits()to determine whether a table has any committed data, useful for conditional processing logic. - Recovery validation: After a job restart, the coordinator checks whether inflight instants need to be recommitted or rolled back.
Theoretical Basis
The verification follows a fail-fast validation pattern:
FUNCTION validateWriteStatus(config, currentInstant, writeStatusList):
IF config.get(WRITE_FAIL_FAST) == true:
FOR EACH writeStatus IN writeStatusList:
IF writeStatus.hasErrors():
errorEntry = writeStatus.getErrors().firstEntry()
THROW HoodieException(
"Write failure for key " + errorEntry.key
+ " at instant " + currentInstant,
cause = errorEntry.value
)
// If WRITE_FAIL_FAST is false, errors are logged but not thrown;
// they will surface at commit time via the coordinator.
FUNCTION haveSuccessfulCommits(metaClient):
timeline = metaClient.getCommitsTimeline().filterCompletedInstants()
RETURN !timeline.empty()
The error propagation model works as follows:
Write Task Coordinator
| |
|-- flush bucket -------------->|
| [WriteStatus with errors] |
| |
|-- validateWriteStatus() ---->|
| IF fail-fast: |
| THROW -> checkpoint fail |
| ELSE: |
| continue |
| |
|-- send WriteMetadataEvent -->|
| |-- notifyCheckpointComplete()
| |-- commit(instant, statuses)
| | [errors logged at commit time]
The fail-fast path ensures that the Flink checkpoint fails, triggering job restart from the last good checkpoint. The non-fail-fast path defers error handling to the commit phase, where errors are logged but the commit may still succeed (recording partial results).
The haveSuccessfulCommits() check queries the Hudi active timeline for any completed commit instants, providing a simple boolean indicator of table health.