Principle:Apache Hudi Schema Compatibility Verification
| Knowledge Sources | |
|---|---|
| Domains | Data_Lake, Schema_Management |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Before a writer is allowed to produce records under a new schema, the system must verify that the writer schema is backward-compatible with the existing table schema.
Description
Schema Compatibility Verification is the enforcement layer that prevents incompatible data from entering a Hudi table. While Schema Change Planning validates individual type promotions, compatibility verification operates at the whole-schema level, checking that every field in the writer schema can be decoded by a reader using the table schema and vice versa.
There are two distinct compatibility concerns:
- Backward compatibility -- Can a reader with the table schema decode data written under the writer schema? This is the standard Avro-style reader/writer compatibility check. If a writer adds a field that the reader does not know about, the reader simply ignores it. If a writer removes a required field, the reader will fail.
- Projection compatibility -- Is the writer allowed to omit fields that exist in the table schema? In some modes (such as partial updates), projection is allowed; in strict modes it is not.
The verification proceeds in two steps:
- Missing field check -- If projection is not allowed, the system computes the set of table schema fields that are absent from the writer schema (excluding partition columns). If any are missing, a MissingSchemaFieldException is thrown.
- Reader/writer compatibility check -- If validation is enabled and no partition columns are being dropped, the system performs a full reader/writer compatibility analysis. This walks the writer and table schema trees in parallel, checking that every writer field has a corresponding reader field of a compatible type. The result is a SchemaPairCompatibility object that is either COMPATIBLE or INCOMPATIBLE with a descriptive message.
This two-step design allows the system to provide precise error messages: missing fields are reported by name, while type mismatches are reported with full schema context.
Usage
Apply Schema Compatibility Verification whenever:
- A write operation (INSERT, UPSERT, BULK_INSERT) is about to produce records.
- A new writer schema is introduced via a Flink streaming job restart.
- An incoming schema evolution (ALTER TABLE) must be validated against the existing table schema.
- A compaction or clustering job merges files and needs to ensure the output schema is compatible with the table schema.
Theoretical Basis
Schema compatibility is rooted in the theory of subtyping in type systems. A writer schema W is compatible with a reader schema R if W is a subtype of R under the schema's type rules. For record types, this means:
- Every field in R that has no default must exist in W (completeness).
- For every field that exists in both R and W, the type in W must be promotable to the type in R (covariance for read, contravariance for write).
- Additional fields in W that are not in R are safely ignored (structural subtyping).
Pseudocode for schema compatibility checking:
function checkSchemaCompatible(tableSchema, writerSchema, shouldValidate, allowProjection, partitionColumns):
assert tableSchema != null and writerSchema != null
// Step 1: Missing field check
if not allowProjection:
missingFields = findMissingFields(tableSchema, writerSchema, partitionColumns)
if missingFields is not empty:
throw MissingSchemaFieldException(missingFields)
// Step 2: Reader/writer compatibility check
if partitionColumns is empty and shouldValidate:
result = checkReaderWriterCompatibility(writerSchema, tableSchema, checkNaming=true)
if result.type != COMPATIBLE:
throw SchemaBackwardsCompatibilityException(result)
Pseudocode for reader/writer compatibility:
function checkReaderWriterCompatibility(reader, writer, checkNamingOverride):
compatibility = ReaderWriterCompatibilityChecker(checkNamingOverride)
.getCompatibility(reader, writer)
message = formatMessage(compatibility, reader, writer)
return SchemaPairCompatibility(compatibility, reader, writer, message)
The checker walks the two schema trees using a stack-based algorithm. For each pair of (readerField, writerField), it verifies type compatibility recursively. Union types, arrays, maps, and nested records are all handled with type-specific rules derived from the Avro specification (with Hudi-specific relaxations such as ignoring schema names outside of unions).