Principle:Apache Hudi Table Schema Definition

Knowledge Sources	Apache Hudi
Domains	Data_Lake, Stream_Processing
Last Updated	2026-02-08 00:00 GMT

Overview

Hudi Table Schema Definition is the principle of deriving and validating the Avro write schema, table configuration, and connector options from a SQL DDL or Table API declaration before the streaming write pipeline is constructed.

Description

When a user defines a Hudi table through Flink SQL DDL or the Table API, the system must translate the Flink ResolvedSchema (column names, data types, primary keys, partition keys) into a complete set of Hudi write options. This includes:

Avro schema inference: If the user does not explicitly supply an Avro schema, the system infers one from the DDL's physical row type. This schema governs serialization and deserialization for all downstream write operations.
Primary key and partition key extraction: The DDL's PRIMARY KEY constraint maps to Hudi's record key fields, and the PARTITIONED BY clause maps to Hudi's partition path fields.
Table option validation: The system performs sanity checks on the table type (COW vs. MOR), index type, record key presence (required for non-append modes), and ordering fields.
Existing table reconciliation: If the table already exists on the filesystem, the factory reads the existing hoodie.properties to reconcile options like table type and record key, preventing configuration conflicts.
Key generator selection: Based on the number of primary key and partition fields, the system selects an appropriate key generator (e.g., SimpleAvroKeyGenerator, ComplexAvroKeyGenerator, NonpartitionedAvroKeyGenerator, or TimestampBasedAvroKeyGenerator).

This principle ensures that no matter how the user defines the table, the write pipeline always has a consistent, validated configuration to work with.

Usage

Use this principle whenever a Hudi table is registered via Flink SQL DDL or the Table API:

CREATE TABLE my_hudi_table (
  id BIGINT PRIMARY KEY NOT ENFORCED,
  name STRING,
  ts TIMESTAMP(3),
  partition_col STRING
) PARTITIONED BY (partition_col) WITH (
  'connector' = 'hudi',
  'path' = '/path/to/table',
  'table.type' = 'MERGE_ON_READ'
);

It also applies when using the Table API to programmatically register a Hudi sink.

Theoretical Basis

The schema definition pipeline follows a validate-then-infer pattern:

1. EXTRACT raw options from CatalogTable DDL:
   a. Read connector options map
   b. Read ResolvedSchema (columns, primary key, partition keys)
   c. Resolve ObjectIdentifier (database name, table name)

2. RECONCILE with existing table (if present):
   a. Check for .hoodie/hoodie.properties on filesystem
   b. If exists, read existing table config
   c. Merge existing config with DDL options (existing wins on conflicts for table type)

3. VALIDATE configuration:
   a. Table type must be COPY_ON_WRITE or MERGE_ON_READ
   b. Index type must be valid (FLINK_STATE, BUCKET, GLOBAL_RECORD_LEVEL_INDEX)
   c. Record key must exist in schema (unless append mode)
   d. Ordering fields must exist in schema

4. INFER derived options:
   a. Set table name and database name from ObjectIdentifier
   b. Map PRIMARY KEY to record key fields
   c. Map PARTITIONED BY to partition path fields
   d. Select key generator class based on key/partition cardinality
   e. If no Avro schema supplied, convert LogicalType -> Avro Schema
   f. Set compaction, hive sync, and read/write options

5. CONSTRUCT sink:
   a. Return HoodieTableSink with validated Configuration and ResolvedSchema

The Avro schema inference step converts Flink's LogicalType tree into an Avro Schema object using HoodieSchemaConverter.convertToSchema(). The resulting schema string is stored in the configuration under FlinkOptions.SOURCE_AVRO_SCHEMA so that all downstream components share the same schema.

Related Pages

Implemented By

Implementation:Apache_Hudi_HoodieTableFactory_CreateDynamicTableSink

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment