Workflow:Apache Paimon Blob Storage With Descriptors

Knowledge Sources	Apache Paimon Paimon Documentation
Domains	Data_Lake, Unstructured_Data, Data_Engineering
Last Updated	2026-02-07 23:00 GMT

Overview

End-to-end process for storing large binary objects (videos, images, models) in Paimon tables using blob-as-descriptor mode, where file references (path, offset, length) replace inline binary data for efficient metadata management.

Description

This workflow enables Paimon tables to reference large binary objects stored on external storage (OSS, S3, HDFS) without embedding the raw binary data in the table's data files. Instead of storing full binary payloads, the table stores serialized BlobDescriptor objects containing the storage path, byte offset, and length of each blob. This approach keeps table metadata compact, avoids data duplication, and enables lazy loading of binary content. The workflow supports row tracking and data evolution for progressive updates to blob-associated metadata.

Usage

Execute this workflow when you need to manage a catalog of large binary assets (videos, images, ML model weights, audio files) alongside structured metadata. This is the recommended approach when binary objects are already stored on cloud object storage and you want to track them in a Paimon table without copying the data.

Execution Steps

Step 1: Blob Table Schema Definition

Define a table schema that includes a large_binary column designated as the blob field. Enable blob-as-descriptor mode via table options, which instructs the writer to treat the blob column as containing serialized descriptors rather than raw binary data. Enable row tracking and data evolution for change management.

Key considerations:

The blob column type must be large_binary
Set blob-field option to the name of the binary column
Set blob-as-descriptor to true to enable descriptor mode
Enable row-tracking and data-evolution for update support
Other columns store structured metadata alongside blob references

Step 2: Blob Descriptor Construction

For each binary object, obtain its storage location (path), byte offset within the file, and total length in bytes. Construct a BlobDescriptor object from these values and serialize it to bytes. The serialized descriptor replaces the raw binary data in the table write.

Key considerations:

Use FileIO to query file sizes for offset and length calculation
BlobDescriptors are lightweight metadata (path, offset, length)
Serialization produces a compact byte representation
Multiple blobs can reference different regions of the same file

Step 3: Metadata and Descriptor Writing

Combine the structured metadata columns (text, names, tags) with the serialized blob descriptor column into a Pandas DataFrame or Arrow Table. Write the combined data to the Paimon table using the standard batch write builder and commit the transaction.

Key considerations:

The blob column contains serialized BlobDescriptor bytes, not raw binary data
Standard write and commit operations apply
Table size remains proportional to metadata volume, not blob volume
Multiple rows can be written in a single batch

Step 4: Reading and Descriptor Deserialization

Read the table using the standard scan-and-read pipeline. The blob column in the result contains serialized BlobDescriptor bytes. Deserialize each descriptor to recover the storage path, offset, and length of the referenced binary object.

Key considerations:

The read path returns serialized descriptors, not raw binary data
Deserialization reconstructs the BlobDescriptor object
Reads are fast because blob data is not transferred
Predicate pushdown on metadata columns works normally

Step 5: Lazy Blob Data Loading

For each deserialized BlobDescriptor, create a Blob object that provides lazy access to the underlying binary data. Use the FileUriReader with the table's FileIO configuration to read the binary content on demand. This enables selective loading of only the blobs needed for a given operation.

Key considerations:

Blob data is loaded only when explicitly requested
FileUriReader handles authentication and storage protocol details
Data is read from the original storage location (no copies)
Large blobs can be read in chunks for memory efficiency

Execution Diagram

GitHub URL

Workflow Repository