Workflow:Apache Hudi Flink Streaming Write
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, Stream_Processing, Data_Lake |
| Last Updated | 2026-02-08 20:00 GMT |
Overview
End-to-end process for ingesting streaming data into Apache Hudi tables using Apache Flink, supporting both Copy-on-Write (COW) and Merge-on-Read (MOR) table types.
Description
This workflow covers the complete pipeline for writing streaming data into Hudi tables through the Flink datasource integration. It includes configuring the Flink execution environment, defining the Hudi table schema via Flink SQL DDL or the DataStream API, choosing between COW and MOR table types, configuring write operations (insert, upsert, bulk insert), and coordinating distributed writes through the StreamWriteOperatorCoordinator. The write path handles record deduplication, bucket assignment for file group routing, and atomic commits with rollback support.
Usage
Execute this workflow when you have a continuous stream of data (e.g., from Kafka, database CDC, or other Flink sources) that needs to be ingested into a Hudi table on a data lake. Choose COW when you need fast read performance and can tolerate slower writes, or MOR when you need faster writes and can tolerate slightly slower reads (before compaction).
Execution Steps
Step 1: Configure Flink Execution Environment
Set up the Flink streaming execution environment with appropriate checkpoint intervals and parallelism. Checkpointing is essential for Hudi writes as each checkpoint triggers a Hudi commit. Configure the checkpoint interval to control commit frequency and the state backend for fault tolerance.
Key considerations:
- Checkpoint interval directly determines Hudi commit frequency
- Enable exactly-once semantics for data consistency
- Configure sufficient parallelism to match the expected write throughput
Step 2: Define Hudi Table Schema
Create the Hudi table using Flink SQL DDL or programmatically via the HoodieTableFactory. Define the table schema including the record key, precombine field, and partition path. Specify the table type (COPY_ON_WRITE or MERGE_ON_READ) and connector properties.
Key considerations:
- Record key uniquely identifies each row for upsert operations
- Precombine field determines which record wins during deduplication
- Partition path controls how data is organized on storage
- The HoodieTableFactory validates configuration and creates the appropriate sink
Step 3: Configure Write Operation
Select and configure the write operation type: INSERT, UPSERT, BULK_INSERT, or DELETE. Each operation has different performance characteristics and use cases. Configure index type (FLINK_STATE, BUCKET, or RECORD_INDEX) to control how records are routed to file groups.
Key considerations:
- UPSERT uses the index to route updates to existing file groups
- INSERT skips the index lookup for append-only workloads
- BULK_INSERT optimizes for large initial data loads with sort-based partitioning
- Bucket index provides deterministic routing without state overhead
Step 4: Configure Bucket Assignment and Partitioning
Configure how incoming records are assigned to file groups. The BucketAssigner determines which file group receives each record, managing small file handling and creating new file groups when needed. For bucket-based writes, configure the number of buckets per partition.
Key considerations:
- Small file handling packs records into undersized file groups first
- Bucket count affects write parallelism and read performance
- Consistent hashing bucket index supports dynamic bucket resizing
- The MinibatchBucketAssignFunction batches records for efficient assignment
Step 5: Execute Streaming Write Pipeline
Start the Flink job to begin streaming writes. The StreamWriteOperatorCoordinator manages the distributed write process: each write task buffers incoming records, flushes them on checkpoint, and reports write status to the coordinator. The coordinator aggregates results and performs the atomic Hudi commit.
Key considerations:
- Write tasks operate independently and buffer data between checkpoints
- The coordinator ensures all tasks complete before committing
- Failed writes trigger automatic rollback of the inflight commit
- Hive sync can be enabled to automatically update the Hive Metastore after each commit
Step 6: Verify Write Results
After the pipeline is running, verify that data is being written correctly by checking the Hudi timeline for successful commits, querying the table to validate data integrity, and monitoring write metrics via the Grafana dashboard or Flink metrics.
Key considerations:
- Check the .hoodie timeline folder for instant metadata
- Verify record counts match expected throughput
- Monitor for failed or rolled-back commits
- Validate partition structure matches expectations