Principle:Apache Paimon Blob Descriptor Construction

Knowledge Sources	Apache Paimon
Domains	Data_Lake, Blob_Storage
Last Updated	2026-02-07 00:00 GMT

Overview

Mechanism for creating lightweight descriptors that reference external binary objects by URI, offset, and length.

Description

A BlobDescriptor encapsulates a reference to an external binary object with three key attributes:

URI -- the location of the file (e.g., oss://bucket/path/file.mov, s3://bucket/key, or a local filesystem path)
offset -- byte offset within the file where the relevant data begins
length -- number of bytes to read from the offset

Descriptors are serialized into a compact binary format for efficient storage in Paimon table columns. The serialization protocol uses the following layout:

Version (1 byte) -- protocol version for forward compatibility
URI length (4 bytes, little-endian) -- length of the URI string in bytes
URI bytes (variable length) -- UTF-8 encoded URI string
Offset (8 bytes, little-endian) -- byte offset within the file
Length (8 bytes, little-endian) -- number of bytes to read

This compact binary representation minimizes storage overhead per descriptor while retaining all information needed to locate and read the referenced blob. File size detection via file_io.get_file_size() can be used to determine the length parameter when constructing descriptors for complete files.

Usage

Use when preparing references to external files before writing them to a blob-enabled Paimon table. The typical workflow is:

Determine the URI, offset, and length for each external blob
Create a BlobDescriptor for each blob
Serialize each descriptor using serialize()
Store the serialized bytes in the blob column of a PyArrow table

This is the second step in the blob storage pipeline, following schema definition and preceding metadata writing.

Theoretical Basis

The descriptor pattern separates object metadata from object data, enabling O(1) metadata access regardless of blob size. Whether the referenced blob is 1 KB or 1 TB, the descriptor itself occupies only a few dozen bytes.

The compact binary serialization minimizes storage overhead per descriptor. Using fixed-width fields for offset and length (8 bytes each, little-endian) ensures consistent parsing without delimiters or escape sequences. The variable-length URI field is preceded by its length, enabling efficient parsing without scanning for terminators.

The version byte at the start of the serialization format enables forward compatibility. Future versions of the descriptor format can add new fields while remaining parseable by older readers (which can skip unknown trailing bytes) or by raising an explicit version mismatch error.

This design follows the flyweight pattern -- many descriptors can reference different regions within the same file by varying the offset and length, avoiding duplication of the URI and enabling efficient storage of multi-object files.

Related Pages

Implemented By

Implementation:Apache_Paimon_BlobDescriptor_Create_and_Serialize

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment