Principle:Apache Paimon Blob Schema Definition
| Knowledge Sources | |
|---|---|
| Domains | Data_Lake, Blob_Storage |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Mechanism for defining table schemas that support large binary object (blob) storage with descriptor-based referencing.
Description
Blob schema definition configures a Paimon table to store references to large binary objects (images, videos, documents) rather than the raw data itself. The blob column uses PyArrow's large_binary() type and requires specific table options:
- blob-field -- the column name containing blob data
- blob-as-descriptor -- must be set to true
- row-tracking.enabled -- must be set to true
- data-evolution.enabled -- must be set to true
Primary keys are not allowed with blob columns. This descriptor-based approach separates metadata storage from blob data, enabling efficient metadata queries without loading large binary files.
The schema validation enforces these constraints at table creation time, preventing misconfigured blob tables from being created. When a pa.large_binary() column is detected in the PyArrow schema, the validation logic checks that all required options are present and correctly set, and that no primary keys have been specified.
Usage
Use when designing tables that reference large binary objects stored externally (cloud storage, file systems) where storing the actual data inline would be impractical. This is the foundational step for any blob-enabled Paimon table -- the schema must be correctly configured before any blob descriptors can be written or read.
Typical use cases include:
- Media asset management (images, videos, audio files)
- Document storage systems (PDFs, office documents)
- Scientific data repositories (large datasets, instrument output)
- Machine learning pipelines (training data, model artifacts)
Theoretical Basis
Follows the descriptor/reference pattern from object storage systems. Instead of embedding large objects in the data file, a lightweight descriptor (URI, offset, length) is stored. This enables efficient metadata operations and lazy loading of actual blob content on demand.
The separation of schema definition from data storage follows the schema-on-write principle: the table structure is validated and enforced at creation time, ensuring that all subsequent writes conform to the expected format. By requiring specific options at the schema level, the system guarantees that the entire read/write pipeline is correctly configured for blob descriptor handling.
This approach also aligns with the single responsibility principle -- the schema layer is responsible only for defining structure and constraints, while the actual blob storage and retrieval are handled by separate components (writers, readers, and URI readers).