Principle:Apache Hudi Schema Change Planning
| Knowledge Sources | |
|---|---|
| Domains | Data_Lake, Schema_Management |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Before any column alteration is applied to a data lake table, the system must validate that the proposed type promotion or rename is safe for all existing data.
Description
Schema Change Planning is the preparatory phase of schema evolution in which a system examines the current schema and the desired target schema to determine which column types need promotion, which columns have been renamed, and whether every proposed change is permissible. In Apache Hudi's internal schema model, each column carries a stable integer field ID that survives renames, so the system can detect renames by comparing names at the same ID across two schema versions. Type promotions are governed by a type lattice -- a directed graph of allowed widening conversions (for example, INT to LONG, LONG to DOUBLE, DECIMAL to STRING). Any promotion not present in the lattice is rejected before the change reaches storage.
The planning step produces three artefacts:
- A boolean validation for every (source type, destination type) pair that answers the question "Is this promotion safe?"
- A type-changed columns map that records, for each affected top-level field position, the pair (new type, old type).
- A rename columns map that records, for each renamed field, the mapping from its new fully-qualified name to the last segment of its old name.
These artefacts are consumed by downstream execution and read-reconciliation layers. If any validation fails, the ALTER TABLE statement is rejected at the catalog level, preventing corrupt data from being written.
Usage
Apply Schema Change Planning whenever:
- A user issues an ALTER TABLE ... ALTER COLUMN statement that changes a column's data type.
- A user issues an ALTER TABLE ... RENAME COLUMN statement.
- An automated pipeline introduces a schema drift that needs to be reconciled before writing.
- A compaction or clustering job must merge files written under different schema versions.
Theoretical Basis
Schema evolution theory defines a type lattice as a partial order over primitive types where an edge from type A to type B means every value representable in A is also representable in B (possibly with a well-defined conversion). The lattice used by Apache Hudi is:
int --> long --> float --> double --> String
int --> float ^
int --> Decimal |
long --> Decimal |
float --> Decimal -----> String -----+
double --> Decimal
Decimal --> Decimal (if precision/scale are compatible)
String --> date
String --> Decimal
date --> String
binary --> String
Pseudocode for type validation:
function isTypeUpdateAllow(src, dst):
if src or dst is nested type:
raise error // only primitive types
if src == dst:
return true
return lattice.hasEdge(src, dst)
Pseudocode for collecting type-changed columns:
function collectTypeChangedCols(newSchema, oldSchema):
result = {}
for each fieldId in intersection(newSchema.ids, oldSchema.ids):
if type(newSchema, fieldId) != type(oldSchema, fieldId):
parentName = topLevelParent(fieldId, newSchema)
position = indexOf(parentName, newSchema.topFields)
result[position] = (newSchema.type(parentName), oldSchema.type(parentName))
return result
Pseudocode for collecting renamed columns:
function collectRenameCols(oldSchema, newSchema):
result = {}
for each colName in oldSchema.allFullNames:
fieldId = oldSchema.findId(colName)
if fieldId in newSchema.ids AND newSchema.fullName(fieldId) != colName:
result[newSchema.fullName(fieldId)] = lastPart(colName)
return result
The key invariant is that field IDs are immutable across schema versions. A field may change its name or its type, but its ID remains constant. This allows the system to detect renames (same ID, different name) and type changes (same ID, different type) without ambiguity.