Workflow:Apache Hudi Flink Batch Incremental Read

Knowledge Sources	Apache Hudi Hudi Flink Guide Hudi Querying
Domains	Data_Engineering, Stream_Processing, Data_Lake
Last Updated	2026-02-08 20:00 GMT

Overview

End-to-end process for reading Hudi tables in Apache Flink, supporting batch snapshot queries, incremental queries, and continuous streaming reads via the FLIP-27 Source API.

Description

This workflow covers reading data from Hudi tables using Flink's SQL or DataStream API. It supports three query modes: batch snapshot reads that return the latest committed state, incremental reads that return only records changed since a specified point in time, and continuous streaming reads that monitor for new commits and emit changes in real-time. The read path leverages vectorized Parquet reading for performance, expression-based predicate pushdown for partition and column pruning, and the FLIP-27 Source framework for split discovery, enumeration, and parallel reading.

Usage

Execute this workflow when you need to consume data from an existing Hudi table in Flink. Use batch mode for point-in-time analytics or ETL jobs. Use incremental mode for efficient change data capture between pipeline runs. Use streaming mode for continuous real-time processing of new data arriving in the Hudi table.

Execution Steps

Step 1: Configure Flink Read Environment

Set up the Flink execution environment for the desired query mode. For batch reads, configure the execution mode as BATCH. For streaming reads, configure as STREAMING with appropriate checkpoint intervals. Register the Hudi catalog or configure the connector properties for table discovery.

Key considerations:

Batch mode processes all data once and completes
Streaming mode runs continuously, monitoring for new commits
The HoodieTableSource resolves capabilities based on the configured query type

Step 2: Define Query Type and Parameters

Select the query type: snapshot, incremental, or read-optimized. For incremental queries, specify the start and optional end commit timestamps. For streaming reads, configure the monitoring interval and start commit. For read-optimized queries on MOR tables, only the base (Parquet) files are read.

Key considerations:

Snapshot queries always read the latest committed state
Incremental queries use IncrementalInputSplits to identify changed file groups
Read-optimized queries skip log files for faster reads at the cost of freshness
Time-travel queries can read the table state as of a specific commit instant

Step 3: Split Discovery and Partition Pruning

The FileIndex resolves which file groups to read based on configured partition predicates. Expression-based partition pruning eliminates irrelevant partitions before split generation. The DefaultHoodieSplitProvider converts relevant file groups into HoodieSourceSplits for parallel reading.

Key considerations:

Partition pruning significantly reduces I/O for partitioned tables
Column statistics index enables file-level pruning within partitions
Record-level index can accelerate point lookups by specific record keys
The ColumnStatsProbe evaluates min/max statistics to skip irrelevant files

Step 4: Split Enumeration and Assignment

The HoodieSplitEnumerator distributes discovered splits across reader subtasks. For batch mode, the HoodieStaticSplitEnumerator assigns all splits upfront. For streaming mode, the HoodieContinuousSplitEnumerator periodically discovers new splits as commits arrive. The split assigner balances work across available reader tasks.

Key considerations:

Static enumeration assigns all splits at job start for batch processing
Continuous enumeration polls the Hudi timeline at configured intervals
Split assignment strategies include ordered, bucket-aware, and number-based balancing
The enumerator state is checkpointed for exactly-once guarantees

Step 5: Read and Process Data

Reader tasks consume their assigned splits using the HoodieSourceSplitReader. For COW tables, this reads Parquet files directly using vectorized column readers (ParquetColumnarRowSplitReader). For MOR tables, base file records are merged with log file records using the HoodieFileGroupReader. The HoodieRecordEmitter converts internal records to Flink RowData.

Key considerations:

Vectorized Parquet reading provides high-throughput columnar access
MOR merge applies log records on top of base file snapshot
Schema projection pushes down selected columns to the reader level
Each Flink version has its own Parquet reader implementation due to evolving vector APIs

Step 6: Deliver Results

Processed records are emitted downstream in the Flink pipeline for further transformation, aggregation, or output. For streaming reads, the StreamReadOperator manages the lifecycle of split processing and ensures ordered delivery when configured. Monitor read metrics to verify throughput and latency.

Key considerations:

Records maintain Hudi metadata fields when configured (commit time, record key)
Streaming reads deliver changes in commit-time order by default
Backpressure from downstream operators is handled by the split reader framework

Execution Diagram

GitHub URL

Workflow Repository