Workflow:Apache Hudi Flink Streaming Write

Knowledge Sources	Apache Hudi Hudi Flink Guide Hudi Write Operations
Domains	Data_Engineering, Stream_Processing, Data_Lake
Last Updated	2026-02-08 20:00 GMT

Overview

End-to-end process for ingesting streaming data into Apache Hudi tables using Apache Flink, supporting both Copy-on-Write (COW) and Merge-on-Read (MOR) table types.

Description

This workflow covers the complete pipeline for writing streaming data into Hudi tables through the Flink datasource integration. It includes configuring the Flink execution environment, defining the Hudi table schema via Flink SQL DDL or the DataStream API, choosing between COW and MOR table types, configuring write operations (insert, upsert, bulk insert), and coordinating distributed writes through the StreamWriteOperatorCoordinator. The write path handles record deduplication, bucket assignment for file group routing, and atomic commits with rollback support.

Usage

Execute this workflow when you have a continuous stream of data (e.g., from Kafka, database CDC, or other Flink sources) that needs to be ingested into a Hudi table on a data lake. Choose COW when you need fast read performance and can tolerate slower writes, or MOR when you need faster writes and can tolerate slightly slower reads (before compaction).

Execution Steps

Step 1: Configure Flink Execution Environment

Set up the Flink streaming execution environment with appropriate checkpoint intervals and parallelism. Checkpointing is essential for Hudi writes as each checkpoint triggers a Hudi commit. Configure the checkpoint interval to control commit frequency and the state backend for fault tolerance.

Key considerations:

Checkpoint interval directly determines Hudi commit frequency
Enable exactly-once semantics for data consistency
Configure sufficient parallelism to match the expected write throughput

Step 2: Define Hudi Table Schema

Create the Hudi table using Flink SQL DDL or programmatically via the HoodieTableFactory. Define the table schema including the record key, precombine field, and partition path. Specify the table type (COPY_ON_WRITE or MERGE_ON_READ) and connector properties.

Key considerations:

Record key uniquely identifies each row for upsert operations
Precombine field determines which record wins during deduplication
Partition path controls how data is organized on storage
The HoodieTableFactory validates configuration and creates the appropriate sink

Step 3: Configure Write Operation

Select and configure the write operation type: INSERT, UPSERT, BULK_INSERT, or DELETE. Each operation has different performance characteristics and use cases. Configure index type (FLINK_STATE, BUCKET, or RECORD_INDEX) to control how records are routed to file groups.

Key considerations:

UPSERT uses the index to route updates to existing file groups
INSERT skips the index lookup for append-only workloads
BULK_INSERT optimizes for large initial data loads with sort-based partitioning
Bucket index provides deterministic routing without state overhead

Step 4: Configure Bucket Assignment and Partitioning

Configure how incoming records are assigned to file groups. The BucketAssigner determines which file group receives each record, managing small file handling and creating new file groups when needed. For bucket-based writes, configure the number of buckets per partition.

Key considerations:

Small file handling packs records into undersized file groups first
Bucket count affects write parallelism and read performance
Consistent hashing bucket index supports dynamic bucket resizing
The MinibatchBucketAssignFunction batches records for efficient assignment

Step 5: Execute Streaming Write Pipeline

Start the Flink job to begin streaming writes. The StreamWriteOperatorCoordinator manages the distributed write process: each write task buffers incoming records, flushes them on checkpoint, and reports write status to the coordinator. The coordinator aggregates results and performs the atomic Hudi commit.

Key considerations:

Write tasks operate independently and buffer data between checkpoints
The coordinator ensures all tasks complete before committing
Failed writes trigger automatic rollback of the inflight commit
Hive sync can be enabled to automatically update the Hive Metastore after each commit

Step 6: Verify Write Results

After the pipeline is running, verify that data is being written correctly by checking the Hudi timeline for successful commits, querying the table to validate data integrity, and monitoring write metrics via the Grafana dashboard or Flink metrics.

Key considerations:

Check the .hoodie timeline folder for instant metadata
Verify record counts match expected throughput
Monitor for failed or rolled-back commits
Validate partition structure matches expectations

Execution Diagram

GitHub URL

Workflow Repository