Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Apache Hudi Result Delivery

From Leeroopedia


Knowledge Sources
Domains Data_Lake, Stream_Processing
Last Updated 2026-02-08 00:00 GMT

Overview

Emitting processed records from the source operator into the downstream dataflow graph while maintaining position tracking for fault recovery.

Description

The final step of the read pipeline is result delivery: taking the records produced by the split reader and emitting them into the framework's data processing pipeline. While conceptually simple (each record is forwarded to the next operator), this step carries several critical responsibilities:

  1. Record emission: Each record produced by the split reader must be collected into the framework's output channel. The emission mechanism differs between source API generations -- FLIP-27 sources use a SourceOutput collector, while legacy sources use a SourceContext.
  2. Position tracking: After emitting a record, the source must update the split's position (file offset and record offset within the file). This position is used during checkpointing to record exactly how far into each split the reader has progressed. On recovery, reading resumes from the last checkpointed position rather than re-reading the entire split.
  3. Continuous monitoring (streaming): For streaming reads, result delivery is interleaved with timeline monitoring. A monitoring component periodically checks the Hudi timeline for new commit instants, discovers the corresponding file splits, and forwards them to the reader. This creates a continuous cycle of discovery and delivery that keeps the read pipeline progressing as new data arrives.
  4. Backpressure propagation: If downstream operators cannot keep up, the emission mechanism naturally applies backpressure to the reader, which in turn slows split requests to the enumerator. This end-to-end backpressure ensures that the system does not buffer unbounded amounts of data.

Usage

Use this technique as the terminal stage of any source connector that bridges between file-based storage and a stream processing framework. It is relevant when:

  • Records must be emitted with position metadata for exactly-once processing
  • The source supports both bounded and unbounded reads with the same emission mechanism
  • Streaming reads require periodic timeline monitoring interleaved with record delivery
  • The legacy (V1) source path uses a split-monitor/reader two-operator design

Theoretical Basis

Result delivery implements the emit-and-checkpoint pattern. The core invariant is that the position recorded at checkpoint time must precisely reflect which records have been emitted:

// FLIP-27 record emission (per record)
function emitRecord(recordWithPosition, output, split):
    output.collect(recordWithPosition.record)
    split.updatePosition(recordWithPosition.fileOffset, recordWithPosition.recordOffset)

// On checkpoint:
//   split.currentPosition is saved
// On recovery:
//   reader resumes from split.currentPosition

For streaming reads with the legacy source, an additional monitoring loop operates:

// Legacy streaming monitoring (single-threaded coordinator)
function monitorAndForward(context):
    while running:
        synchronized(checkpointLock):
            // Check timeline for new commits since last issued instant
            result = discoverNewSplits(metaClient, lastOffset)

            if result is not empty:
                for each split in result.splits (up to splitsLimit):
                    context.collect(split)
                updateIssuedInstant(result.endInstant)
                updateIssuedOffset(result.offset)

        sleep(checkInterval)

The theoretical properties of result delivery are:

  • At-least-once guarantee: Combined with checkpointing, the position tracking ensures that on failure, records are re-emitted from the last checkpoint rather than lost. Duplicate records may occur if a failure happens between emission and checkpoint completion, but no records are skipped.
  • Exactly-once (with idempotent sinks): When the downstream sink is idempotent (e.g., upsert to a key-value store), the at-least-once guarantee from the source effectively becomes exactly-once end-to-end.
  • Ordering within a split: Records from a single split are emitted in file order (base file records first, then log file records applied in order). Cross-split ordering is not guaranteed and depends on the parallelism and scheduling of reader tasks.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment