Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Apache Paimon Blob Schema Definition

From Leeroopedia


Knowledge Sources
Domains Data_Lake, Blob_Storage
Last Updated 2026-02-07 00:00 GMT

Overview

Mechanism for defining table schemas that support large binary object (blob) storage with descriptor-based referencing.

Description

Blob schema definition configures a Paimon table to store references to large binary objects (images, videos, documents) rather than the raw data itself. The blob column uses PyArrow's large_binary() type and requires specific table options:

  • blob-field -- the column name containing blob data
  • blob-as-descriptor -- must be set to true
  • row-tracking.enabled -- must be set to true
  • data-evolution.enabled -- must be set to true

Primary keys are not allowed with blob columns. This descriptor-based approach separates metadata storage from blob data, enabling efficient metadata queries without loading large binary files.

The schema validation enforces these constraints at table creation time, preventing misconfigured blob tables from being created. When a pa.large_binary() column is detected in the PyArrow schema, the validation logic checks that all required options are present and correctly set, and that no primary keys have been specified.

Usage

Use when designing tables that reference large binary objects stored externally (cloud storage, file systems) where storing the actual data inline would be impractical. This is the foundational step for any blob-enabled Paimon table -- the schema must be correctly configured before any blob descriptors can be written or read.

Typical use cases include:

  • Media asset management (images, videos, audio files)
  • Document storage systems (PDFs, office documents)
  • Scientific data repositories (large datasets, instrument output)
  • Machine learning pipelines (training data, model artifacts)

Theoretical Basis

Follows the descriptor/reference pattern from object storage systems. Instead of embedding large objects in the data file, a lightweight descriptor (URI, offset, length) is stored. This enables efficient metadata operations and lazy loading of actual blob content on demand.

The separation of schema definition from data storage follows the schema-on-write principle: the table structure is validated and enforced at creation time, ensuring that all subsequent writes conform to the expected format. By requiring specific options at the schema level, the system guarantees that the entire read/write pipeline is correctly configured for blob descriptor handling.

This approach also aligns with the single responsibility principle -- the schema layer is responsible only for defining structure and constraints, while the actual blob storage and retrieval are handled by separate components (writers, readers, and URI readers).

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment