Principle:Apache Paimon Schema Definition and Table Creation
| Knowledge Sources | |
|---|---|
| Domains | Data_Lake, Table_Format |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Mechanism for defining table schemas and creating databases and tables within a data lake catalog.
Description
Schema definition and table creation involve specifying column types, partition keys, primary keys, and table options to create a structured table in the catalog. The Schema class wraps PyArrow schemas with additional Paimon-specific metadata. Tables are created atomically via the Catalog interface, which handles both metadata registration and storage initialization. The process ensures type compatibility between PyArrow and Paimon type systems.
Databases serve as logical namespaces that group related tables together. Tables within a database are identified by a fully qualified name (e.g., 'db.table') encapsulated in an Identifier object. The schema defines the physical structure of the table, including column data types, partitioning strategy, primary key constraints, and table-level options such as bucket count.
Usage
Use this principle after catalog initialization when setting up new tables for data storage. This is required before any read or write operations can be performed. The typical workflow involves: (1) creating a database if it does not exist, (2) defining the table schema with column types, partition keys, and primary keys, and (3) creating the table in the catalog with the specified schema.
Theoretical Basis
Follows the schema-on-write pattern where table structure is defined at creation time. This approach provides several guarantees:
- Partition pruning: Partition keys enable the query engine to skip entire partitions that do not match a filter predicate, dramatically reducing I/O.
- Primary key constraints: Primary keys enable merge-on-read semantics where updates to existing rows are handled by merging new writes with existing data during reads.
- Type safety: The schema enforces column types at write time, preventing type mismatches from corrupting the data lake.
- Atomic creation: Table creation is atomic -- either the table and all its metadata are fully created, or nothing is changed in the catalog.