Workflow:Apache Hudi Flink Batch Incremental Read
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, Stream_Processing, Data_Lake |
| Last Updated | 2026-02-08 20:00 GMT |
Overview
End-to-end process for reading Hudi tables in Apache Flink, supporting batch snapshot queries, incremental queries, and continuous streaming reads via the FLIP-27 Source API.
Description
This workflow covers reading data from Hudi tables using Flink's SQL or DataStream API. It supports three query modes: batch snapshot reads that return the latest committed state, incremental reads that return only records changed since a specified point in time, and continuous streaming reads that monitor for new commits and emit changes in real-time. The read path leverages vectorized Parquet reading for performance, expression-based predicate pushdown for partition and column pruning, and the FLIP-27 Source framework for split discovery, enumeration, and parallel reading.
Usage
Execute this workflow when you need to consume data from an existing Hudi table in Flink. Use batch mode for point-in-time analytics or ETL jobs. Use incremental mode for efficient change data capture between pipeline runs. Use streaming mode for continuous real-time processing of new data arriving in the Hudi table.
Execution Steps
Step 1: Configure Flink Read Environment
Set up the Flink execution environment for the desired query mode. For batch reads, configure the execution mode as BATCH. For streaming reads, configure as STREAMING with appropriate checkpoint intervals. Register the Hudi catalog or configure the connector properties for table discovery.
Key considerations:
- Batch mode processes all data once and completes
- Streaming mode runs continuously, monitoring for new commits
- The HoodieTableSource resolves capabilities based on the configured query type
Step 2: Define Query Type and Parameters
Select the query type: snapshot, incremental, or read-optimized. For incremental queries, specify the start and optional end commit timestamps. For streaming reads, configure the monitoring interval and start commit. For read-optimized queries on MOR tables, only the base (Parquet) files are read.
Key considerations:
- Snapshot queries always read the latest committed state
- Incremental queries use IncrementalInputSplits to identify changed file groups
- Read-optimized queries skip log files for faster reads at the cost of freshness
- Time-travel queries can read the table state as of a specific commit instant
Step 3: Split Discovery and Partition Pruning
The FileIndex resolves which file groups to read based on configured partition predicates. Expression-based partition pruning eliminates irrelevant partitions before split generation. The DefaultHoodieSplitProvider converts relevant file groups into HoodieSourceSplits for parallel reading.
Key considerations:
- Partition pruning significantly reduces I/O for partitioned tables
- Column statistics index enables file-level pruning within partitions
- Record-level index can accelerate point lookups by specific record keys
- The ColumnStatsProbe evaluates min/max statistics to skip irrelevant files
Step 4: Split Enumeration and Assignment
The HoodieSplitEnumerator distributes discovered splits across reader subtasks. For batch mode, the HoodieStaticSplitEnumerator assigns all splits upfront. For streaming mode, the HoodieContinuousSplitEnumerator periodically discovers new splits as commits arrive. The split assigner balances work across available reader tasks.
Key considerations:
- Static enumeration assigns all splits at job start for batch processing
- Continuous enumeration polls the Hudi timeline at configured intervals
- Split assignment strategies include ordered, bucket-aware, and number-based balancing
- The enumerator state is checkpointed for exactly-once guarantees
Step 5: Read and Process Data
Reader tasks consume their assigned splits using the HoodieSourceSplitReader. For COW tables, this reads Parquet files directly using vectorized column readers (ParquetColumnarRowSplitReader). For MOR tables, base file records are merged with log file records using the HoodieFileGroupReader. The HoodieRecordEmitter converts internal records to Flink RowData.
Key considerations:
- Vectorized Parquet reading provides high-throughput columnar access
- MOR merge applies log records on top of base file snapshot
- Schema projection pushes down selected columns to the reader level
- Each Flink version has its own Parquet reader implementation due to evolving vector APIs
Step 6: Deliver Results
Processed records are emitted downstream in the Flink pipeline for further transformation, aggregation, or output. For streaming reads, the StreamReadOperator manages the lifecycle of split processing and ensures ordered delivery when configured. Monitor read metrics to verify throughput and latency.
Key considerations:
- Records maintain Hudi metadata fields when configured (commit time, record key)
- Streaming reads deliver changes in commit-time order by default
- Backpressure from downstream operators is handled by the split reader framework