Implementation:Apache Hudi HoodieRecordEmitter EmitRecord
| Knowledge Sources | |
|---|---|
| Domains | Data_Lake, Stream_Processing |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Concrete tool for emitting Hudi records to downstream Flink operators with split position tracking, provided by Apache Hudi.
Description
HoodieRecordEmitter is the default implementation of Flink's RecordEmitter interface for the Hudi FLIP-27 source. It bridges the gap between the split reader's output (records with position metadata) and the Flink source framework's output channel.
The emitRecord method performs two operations in sequence:
- Record collection: Calls
output.collect(record.record())to emit the actual data record (e.g., aRowData) to the downstream Flink operator chain. - Position update: Calls
split.updatePosition(record.fileOffset(), record.recordOffset())to advance the split's tracked position. This position is included in the next checkpoint, enabling the source to resume from exactly this point on recovery.
The class is generic over T (the record type, typically RowData), making it reusable across different table formats and read modes. It receives HoodieRecordWithPosition<T> objects from the split reader, which wrap the actual record with file offset (which file within the file group) and record offset (which record within that file) metadata.
The companion class StreamReadMonitoringFunction implements result delivery for the legacy (V1) source path. It runs as a single-parallelism source function that periodically monitors the Hudi timeline for new commit instants via IncrementalInputSplits. When new splits are discovered, they are emitted to downstream StreamReadOperator instances via a SourceContext. The function manages checkpointed state for the last issued instant and offset, and supports a configurable splits-per-cycle limit to control memory usage.
Usage
HoodieRecordEmitter is used internally by the Hudi FLIP-27 source. It is instantiated by the source reader and invoked for each record produced by the split reader function. Users do not interact with it directly.
StreamReadMonitoringFunction is used in the legacy V1 source path when read.streaming.enabled = true and read.source.v2.enabled = false. It is configured through:
read.streaming.check-interval-- monitoring interval in secondsread.splits.limit-- maximum number of splits emitted per monitoring cycle
Code Reference
Source Location
- Repository: Apache Hudi
- File:
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/reader/HoodieRecordEmitter.java - Lines: 29-36
- Also:
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/StreamReadMonitoringFunction.java(Lines 125-280)
Signature
public class HoodieRecordEmitter<T>
implements RecordEmitter<HoodieRecordWithPosition<T>, T, HoodieSourceSplit> {
@Override
public void emitRecord(
HoodieRecordWithPosition<T> record,
SourceOutput<T> output,
HoodieSourceSplit split) throws Exception {
output.collect(record.record());
split.updatePosition(record.fileOffset(), record.recordOffset());
}
}
Import
import org.apache.hudi.source.reader.HoodieRecordEmitter;
import org.apache.hudi.source.reader.HoodieRecordWithPosition;
import org.apache.hudi.source.split.HoodieSourceSplit;
import org.apache.flink.api.connector.source.SourceOutput;
import org.apache.flink.connector.base.source.reader.RecordEmitter;
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| record | HoodieRecordWithPosition<T> |
Yes | The record to emit, wrapping the actual data record with file offset and record offset metadata. The record() method returns the payload (e.g., RowData); fileOffset() returns the index of the file within the file group; recordOffset() returns the record's position within that file.
|
| output | SourceOutput<T> |
Yes | The Flink source output collector. Calling collect() emits the record to downstream operators.
|
| split | HoodieSourceSplit |
Yes | The source split being read. Its updatePosition() method is called to track read progress for checkpointing.
|
Outputs
| Name | Type | Description |
|---|---|---|
| Emitted record | T (typically RowData) |
The data record collected into the Flink output channel, available to downstream operators (filters, aggregations, sinks). |
| Updated split position | void (side effect on HoodieSourceSplit) |
The split's internal position is advanced to reflect the emitted record. This position is captured during the next checkpoint for fault-tolerant recovery. |
Usage Examples
// HoodieRecordEmitter is created and used by the Hudi source reader internally.
// The following illustrates the emission lifecycle:
import org.apache.hudi.source.reader.HoodieRecordEmitter;
import org.apache.hudi.source.reader.HoodieRecordWithPosition;
import org.apache.hudi.source.split.HoodieSourceSplit;
import org.apache.flink.api.connector.source.SourceOutput;
import org.apache.flink.table.data.RowData;
// Create the emitter (once per source reader)
HoodieRecordEmitter<RowData> emitter = new HoodieRecordEmitter<>();
// For each record from the split reader:
HoodieRecordWithPosition<RowData> record = ...; // from HoodieSplitReaderFunction.read()
SourceOutput<RowData> output = ...; // provided by Flink framework
HoodieSourceSplit split = ...; // the current split being read
// Emit the record and update position
emitter.emitRecord(record, output, split);
// record.record() is now available to downstream Flink operators
// split position is updated to (record.fileOffset(), record.recordOffset())
// Legacy V1 streaming path (StreamReadMonitoringFunction):
// StreamReadMonitoringFunction monitorFunc = new StreamReadMonitoringFunction(
// conf, path, rowType, maxCompactionMemoryInBytes, partitionPruner);
// execEnv.addSource(monitorFunc).setParallelism(1);
// Periodically discovers new splits and emits them via context.collect(split)