Workflow:Apache Hudi Flink Schema Evolution
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, Data_Lake, Schema_Management |
| Last Updated | 2026-02-08 20:00 GMT |
Overview
End-to-end process for evolving the schema of Apache Hudi tables in Flink, supporting column additions, type promotions, and column renames while maintaining backward compatibility with existing data.
Description
This workflow covers schema evolution for Hudi tables accessed through the Flink datasource. Hudi tracks schema changes in its timeline metadata and supports backward-compatible schema modifications including adding new columns, widening column types (e.g., INT to LONG), and renaming columns. The Flink read path handles schema reconciliation automatically by projecting old data through the new schema using SchemaEvolvingRowDataProjection. Schema changes can be performed through Flink SQL ALTER TABLE statements or by writing data with a new schema, with the Hudi writer automatically merging schemas.
Usage
Execute this workflow when you need to modify the schema of an existing Hudi table without rewriting all existing data. Common scenarios include adding new fields to accommodate evolving business requirements, widening numeric types to prevent overflow, renaming columns for clarity, or dropping columns that are no longer needed.
Execution Steps
Step 1: Plan Schema Changes
Determine the required schema modifications and verify they are compatible with Hudi's schema evolution rules. Hudi supports adding nullable columns (with optional defaults), widening numeric types (INT to LONG, FLOAT to DOUBLE), and renaming columns. Breaking changes like narrowing types or removing non-nullable columns require special handling.
Key considerations:
- New columns must be nullable or have default values for backward compatibility
- Type widening follows Avro promotion rules (INT to LONG, FLOAT to DOUBLE)
- Record key and precombine columns cannot be changed
- Partition columns have limited evolution support
Step 2: Execute Schema Change via SQL
Apply the schema change using Flink SQL ALTER TABLE statements or by writing data with the evolved schema. When using SQL, the HoodieTableFactory processes the ALTER TABLE command and updates the table schema in the Hudi metadata. When writing with a new schema, the Hudi writer detects the schema difference and merges it with the existing table schema.
Key considerations:
- ALTER TABLE ADD COLUMNS appends new columns to the schema
- ALTER TABLE RENAME COLUMN changes column names in metadata
- Writing with a new schema triggers automatic schema merging
- The schema change is recorded as part of the next commit instant
Step 3: Verify Schema Compatibility
After the schema change, verify that the new schema is correctly registered in the Hudi table metadata. Check that existing data can still be read correctly under the new schema, and that new writes conform to the updated schema. The internal schema reconciliation handles mapping between old and new column positions.
Key considerations:
- Query existing data to verify backward compatibility
- New columns should return null for old records (unless defaults are configured)
- Type-widened columns should correctly promote old values
- The Avro schema stored in .hoodie reflects the latest merged schema
Step 4: Handle Read Path Reconciliation
When reading data files written with an older schema, the read path automatically reconciles differences. The SchemaEvolvingRowDataProjection maps columns from the file schema to the query schema, handling added columns (filled with nulls or defaults), renamed columns (mapped by field ID), and widened types (automatic promotion).
Key considerations:
- Files written with different schemas coexist in the same table
- Each Parquet file embeds its write schema for self-describing reconciliation
- The FlinkRowDataReaderContext adapts between Hudi's internal reader and Flink types
- Vectorized Parquet reading supports evolved schemas through column index mapping
Step 5: Validate End to End
Run comprehensive validation to ensure the schema evolution is complete and correct. Write new records with the evolved schema, read back data spanning both old and new schema versions, and verify that all query types (snapshot, incremental, read-optimized) return correct results with the evolved schema.
Key considerations:
- Test with both COW and MOR table types if applicable
- Incremental queries should correctly handle schema changes across commit boundaries
- Catalog sync (Hive Metastore) must be updated to reflect the new schema
- Monitor for any serialization errors in downstream consumers