Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Apache Hudi Schema Change Planning

From Leeroopedia


Knowledge Sources
Domains Data_Lake, Schema_Management
Last Updated 2026-02-08 00:00 GMT

Overview

Before any column alteration is applied to a data lake table, the system must validate that the proposed type promotion or rename is safe for all existing data.

Description

Schema Change Planning is the preparatory phase of schema evolution in which a system examines the current schema and the desired target schema to determine which column types need promotion, which columns have been renamed, and whether every proposed change is permissible. In Apache Hudi's internal schema model, each column carries a stable integer field ID that survives renames, so the system can detect renames by comparing names at the same ID across two schema versions. Type promotions are governed by a type lattice -- a directed graph of allowed widening conversions (for example, INT to LONG, LONG to DOUBLE, DECIMAL to STRING). Any promotion not present in the lattice is rejected before the change reaches storage.

The planning step produces three artefacts:

  1. A boolean validation for every (source type, destination type) pair that answers the question "Is this promotion safe?"
  2. A type-changed columns map that records, for each affected top-level field position, the pair (new type, old type).
  3. A rename columns map that records, for each renamed field, the mapping from its new fully-qualified name to the last segment of its old name.

These artefacts are consumed by downstream execution and read-reconciliation layers. If any validation fails, the ALTER TABLE statement is rejected at the catalog level, preventing corrupt data from being written.

Usage

Apply Schema Change Planning whenever:

  • A user issues an ALTER TABLE ... ALTER COLUMN statement that changes a column's data type.
  • A user issues an ALTER TABLE ... RENAME COLUMN statement.
  • An automated pipeline introduces a schema drift that needs to be reconciled before writing.
  • A compaction or clustering job must merge files written under different schema versions.

Theoretical Basis

Schema evolution theory defines a type lattice as a partial order over primitive types where an edge from type A to type B means every value representable in A is also representable in B (possibly with a well-defined conversion). The lattice used by Apache Hudi is:

int --> long --> float --> double --> String
int --> float                        ^
int --> Decimal                      |
long --> Decimal                     |
float --> Decimal -----> String -----+
double --> Decimal
Decimal --> Decimal (if precision/scale are compatible)
String --> date
String --> Decimal
date --> String
binary --> String

Pseudocode for type validation:

function isTypeUpdateAllow(src, dst):
    if src or dst is nested type:
        raise error  // only primitive types
    if src == dst:
        return true
    return lattice.hasEdge(src, dst)

Pseudocode for collecting type-changed columns:

function collectTypeChangedCols(newSchema, oldSchema):
    result = {}
    for each fieldId in intersection(newSchema.ids, oldSchema.ids):
        if type(newSchema, fieldId) != type(oldSchema, fieldId):
            parentName = topLevelParent(fieldId, newSchema)
            position = indexOf(parentName, newSchema.topFields)
            result[position] = (newSchema.type(parentName), oldSchema.type(parentName))
    return result

Pseudocode for collecting renamed columns:

function collectRenameCols(oldSchema, newSchema):
    result = {}
    for each colName in oldSchema.allFullNames:
        fieldId = oldSchema.findId(colName)
        if fieldId in newSchema.ids AND newSchema.fullName(fieldId) != colName:
            result[newSchema.fullName(fieldId)] = lastPart(colName)
    return result

The key invariant is that field IDs are immutable across schema versions. A field may change its name or its type, but its ID remains constant. This allows the system to detect renames (same ID, different name) and type changes (same ID, different type) without ambiguity.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment