Principle:Apache Paimon Blob Descriptor Construction
| Knowledge Sources | |
|---|---|
| Domains | Data_Lake, Blob_Storage |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Mechanism for creating lightweight descriptors that reference external binary objects by URI, offset, and length.
Description
A BlobDescriptor encapsulates a reference to an external binary object with three key attributes:
- URI -- the location of the file (e.g., oss://bucket/path/file.mov, s3://bucket/key, or a local filesystem path)
- offset -- byte offset within the file where the relevant data begins
- length -- number of bytes to read from the offset
Descriptors are serialized into a compact binary format for efficient storage in Paimon table columns. The serialization protocol uses the following layout:
- Version (1 byte) -- protocol version for forward compatibility
- URI length (4 bytes, little-endian) -- length of the URI string in bytes
- URI bytes (variable length) -- UTF-8 encoded URI string
- Offset (8 bytes, little-endian) -- byte offset within the file
- Length (8 bytes, little-endian) -- number of bytes to read
This compact binary representation minimizes storage overhead per descriptor while retaining all information needed to locate and read the referenced blob. File size detection via file_io.get_file_size() can be used to determine the length parameter when constructing descriptors for complete files.
Usage
Use when preparing references to external files before writing them to a blob-enabled Paimon table. The typical workflow is:
- Determine the URI, offset, and length for each external blob
- Create a BlobDescriptor for each blob
- Serialize each descriptor using serialize()
- Store the serialized bytes in the blob column of a PyArrow table
This is the second step in the blob storage pipeline, following schema definition and preceding metadata writing.
Theoretical Basis
The descriptor pattern separates object metadata from object data, enabling O(1) metadata access regardless of blob size. Whether the referenced blob is 1 KB or 1 TB, the descriptor itself occupies only a few dozen bytes.
The compact binary serialization minimizes storage overhead per descriptor. Using fixed-width fields for offset and length (8 bytes each, little-endian) ensures consistent parsing without delimiters or escape sequences. The variable-length URI field is preceded by its length, enabling efficient parsing without scanning for terminators.
The version byte at the start of the serialization format enables forward compatibility. Future versions of the descriptor format can add new fields while remaining parseable by older readers (which can skip unknown trailing bytes) or by raising an explicit version mismatch error.
This design follows the flyweight pattern -- many descriptors can reference different regions within the same file by varying the offset and length, avoiding duplication of the URI and enabling efficient storage of multi-object files.