Principle:Apache Paimon Blob Schema Definition

Knowledge Sources	Apache Paimon
Domains	Data_Lake, Blob_Storage
Last Updated	2026-02-07 00:00 GMT

Overview

Mechanism for defining table schemas that support large binary object (blob) storage with descriptor-based referencing.

Description

Blob schema definition configures a Paimon table to store references to large binary objects (images, videos, documents) rather than the raw data itself. The blob column uses PyArrow's large_binary() type and requires specific table options:

blob-field -- the column name containing blob data
blob-as-descriptor -- must be set to true
row-tracking.enabled -- must be set to true
data-evolution.enabled -- must be set to true

Primary keys are not allowed with blob columns. This descriptor-based approach separates metadata storage from blob data, enabling efficient metadata queries without loading large binary files.

The schema validation enforces these constraints at table creation time, preventing misconfigured blob tables from being created. When a pa.large_binary() column is detected in the PyArrow schema, the validation logic checks that all required options are present and correctly set, and that no primary keys have been specified.

Usage

Use when designing tables that reference large binary objects stored externally (cloud storage, file systems) where storing the actual data inline would be impractical. This is the foundational step for any blob-enabled Paimon table -- the schema must be correctly configured before any blob descriptors can be written or read.

Typical use cases include:

Media asset management (images, videos, audio files)
Document storage systems (PDFs, office documents)
Scientific data repositories (large datasets, instrument output)
Machine learning pipelines (training data, model artifacts)

Theoretical Basis

Follows the descriptor/reference pattern from object storage systems. Instead of embedding large objects in the data file, a lightweight descriptor (URI, offset, length) is stored. This enables efficient metadata operations and lazy loading of actual blob content on demand.

The separation of schema definition from data storage follows the schema-on-write principle: the table structure is validated and enforced at creation time, ensuring that all subsequent writes conform to the expected format. By requiring specific options at the schema level, the system guarantees that the entire read/write pipeline is correctly configured for blob descriptor handling.

This approach also aligns with the single responsibility principle -- the schema layer is responsible only for defining structure and constraints, while the actual blob storage and retrieval are handled by separate components (writers, readers, and URI readers).

Related Pages

Implemented By

Implementation:Apache_Paimon_Schema_With_Blob_Options

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment