Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Apache Paimon Blob Descriptor Construction

From Leeroopedia


Knowledge Sources
Domains Data_Lake, Blob_Storage
Last Updated 2026-02-07 00:00 GMT

Overview

Mechanism for creating lightweight descriptors that reference external binary objects by URI, offset, and length.

Description

A BlobDescriptor encapsulates a reference to an external binary object with three key attributes:

  • URI -- the location of the file (e.g., oss://bucket/path/file.mov, s3://bucket/key, or a local filesystem path)
  • offset -- byte offset within the file where the relevant data begins
  • length -- number of bytes to read from the offset

Descriptors are serialized into a compact binary format for efficient storage in Paimon table columns. The serialization protocol uses the following layout:

  1. Version (1 byte) -- protocol version for forward compatibility
  2. URI length (4 bytes, little-endian) -- length of the URI string in bytes
  3. URI bytes (variable length) -- UTF-8 encoded URI string
  4. Offset (8 bytes, little-endian) -- byte offset within the file
  5. Length (8 bytes, little-endian) -- number of bytes to read

This compact binary representation minimizes storage overhead per descriptor while retaining all information needed to locate and read the referenced blob. File size detection via file_io.get_file_size() can be used to determine the length parameter when constructing descriptors for complete files.

Usage

Use when preparing references to external files before writing them to a blob-enabled Paimon table. The typical workflow is:

  1. Determine the URI, offset, and length for each external blob
  2. Create a BlobDescriptor for each blob
  3. Serialize each descriptor using serialize()
  4. Store the serialized bytes in the blob column of a PyArrow table

This is the second step in the blob storage pipeline, following schema definition and preceding metadata writing.

Theoretical Basis

The descriptor pattern separates object metadata from object data, enabling O(1) metadata access regardless of blob size. Whether the referenced blob is 1 KB or 1 TB, the descriptor itself occupies only a few dozen bytes.

The compact binary serialization minimizes storage overhead per descriptor. Using fixed-width fields for offset and length (8 bytes each, little-endian) ensures consistent parsing without delimiters or escape sequences. The variable-length URI field is preceded by its length, enabling efficient parsing without scanning for terminators.

The version byte at the start of the serialization format enables forward compatibility. Future versions of the descriptor format can add new fields while remaining parseable by older readers (which can skip unknown trailing bytes) or by raising an explicit version mismatch error.

This design follows the flyweight pattern -- many descriptors can reference different regions within the same file by varying the offset and length, avoiding duplication of the URI and enabling efficient storage of multi-object files.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment