Workflow:Apache Hudi Flink Schema Evolution

Knowledge Sources	Apache Hudi Hudi Schema Evolution Hudi Flink Guide
Domains	Data_Engineering, Data_Lake, Schema_Management
Last Updated	2026-02-08 20:00 GMT

Overview

End-to-end process for evolving the schema of Apache Hudi tables in Flink, supporting column additions, type promotions, and column renames while maintaining backward compatibility with existing data.

Description

This workflow covers schema evolution for Hudi tables accessed through the Flink datasource. Hudi tracks schema changes in its timeline metadata and supports backward-compatible schema modifications including adding new columns, widening column types (e.g., INT to LONG), and renaming columns. The Flink read path handles schema reconciliation automatically by projecting old data through the new schema using SchemaEvolvingRowDataProjection. Schema changes can be performed through Flink SQL ALTER TABLE statements or by writing data with a new schema, with the Hudi writer automatically merging schemas.

Usage

Execute this workflow when you need to modify the schema of an existing Hudi table without rewriting all existing data. Common scenarios include adding new fields to accommodate evolving business requirements, widening numeric types to prevent overflow, renaming columns for clarity, or dropping columns that are no longer needed.

Execution Steps

Step 1: Plan Schema Changes

Determine the required schema modifications and verify they are compatible with Hudi's schema evolution rules. Hudi supports adding nullable columns (with optional defaults), widening numeric types (INT to LONG, FLOAT to DOUBLE), and renaming columns. Breaking changes like narrowing types or removing non-nullable columns require special handling.

Key considerations:

New columns must be nullable or have default values for backward compatibility
Type widening follows Avro promotion rules (INT to LONG, FLOAT to DOUBLE)
Record key and precombine columns cannot be changed
Partition columns have limited evolution support

Step 2: Execute Schema Change via SQL

Apply the schema change using Flink SQL ALTER TABLE statements or by writing data with the evolved schema. When using SQL, the HoodieTableFactory processes the ALTER TABLE command and updates the table schema in the Hudi metadata. When writing with a new schema, the Hudi writer detects the schema difference and merges it with the existing table schema.

Key considerations:

ALTER TABLE ADD COLUMNS appends new columns to the schema
ALTER TABLE RENAME COLUMN changes column names in metadata
Writing with a new schema triggers automatic schema merging
The schema change is recorded as part of the next commit instant

Step 3: Verify Schema Compatibility

After the schema change, verify that the new schema is correctly registered in the Hudi table metadata. Check that existing data can still be read correctly under the new schema, and that new writes conform to the updated schema. The internal schema reconciliation handles mapping between old and new column positions.

Key considerations:

Query existing data to verify backward compatibility
New columns should return null for old records (unless defaults are configured)
Type-widened columns should correctly promote old values
The Avro schema stored in .hoodie reflects the latest merged schema

Step 4: Handle Read Path Reconciliation

When reading data files written with an older schema, the read path automatically reconciles differences. The SchemaEvolvingRowDataProjection maps columns from the file schema to the query schema, handling added columns (filled with nulls or defaults), renamed columns (mapped by field ID), and widened types (automatic promotion).

Key considerations:

Files written with different schemas coexist in the same table
Each Parquet file embeds its write schema for self-describing reconciliation
The FlinkRowDataReaderContext adapts between Hudi's internal reader and Flink types
Vectorized Parquet reading supports evolved schemas through column index mapping

Step 5: Validate End to End

Run comprehensive validation to ensure the schema evolution is complete and correct. Write new records with the evolved schema, read back data spanning both old and new schema versions, and verify that all query types (snapshot, incremental, read-optimized) return correct results with the evolved schema.

Key considerations:

Test with both COW and MOR table types if applicable
Incremental queries should correctly handle schema changes across commit boundaries
Catalog sync (Hive Metastore) must be updated to reflect the new schema
Monitor for any serialization errors in downstream consumers

Execution Diagram

GitHub URL

Workflow Repository