Workflow:Apache Paimon Blob Storage With Descriptors
| Knowledge Sources | |
|---|---|
| Domains | Data_Lake, Unstructured_Data, Data_Engineering |
| Last Updated | 2026-02-07 23:00 GMT |
Overview
End-to-end process for storing large binary objects (videos, images, models) in Paimon tables using blob-as-descriptor mode, where file references (path, offset, length) replace inline binary data for efficient metadata management.
Description
This workflow enables Paimon tables to reference large binary objects stored on external storage (OSS, S3, HDFS) without embedding the raw binary data in the table's data files. Instead of storing full binary payloads, the table stores serialized BlobDescriptor objects containing the storage path, byte offset, and length of each blob. This approach keeps table metadata compact, avoids data duplication, and enables lazy loading of binary content. The workflow supports row tracking and data evolution for progressive updates to blob-associated metadata.
Usage
Execute this workflow when you need to manage a catalog of large binary assets (videos, images, ML model weights, audio files) alongside structured metadata. This is the recommended approach when binary objects are already stored on cloud object storage and you want to track them in a Paimon table without copying the data.
Execution Steps
Step 1: Blob Table Schema Definition
Define a table schema that includes a large_binary column designated as the blob field. Enable blob-as-descriptor mode via table options, which instructs the writer to treat the blob column as containing serialized descriptors rather than raw binary data. Enable row tracking and data evolution for change management.
Key considerations:
- The blob column type must be large_binary
- Set blob-field option to the name of the binary column
- Set blob-as-descriptor to true to enable descriptor mode
- Enable row-tracking and data-evolution for update support
- Other columns store structured metadata alongside blob references
Step 2: Blob Descriptor Construction
For each binary object, obtain its storage location (path), byte offset within the file, and total length in bytes. Construct a BlobDescriptor object from these values and serialize it to bytes. The serialized descriptor replaces the raw binary data in the table write.
Key considerations:
- Use FileIO to query file sizes for offset and length calculation
- BlobDescriptors are lightweight metadata (path, offset, length)
- Serialization produces a compact byte representation
- Multiple blobs can reference different regions of the same file
Step 3: Metadata and Descriptor Writing
Combine the structured metadata columns (text, names, tags) with the serialized blob descriptor column into a Pandas DataFrame or Arrow Table. Write the combined data to the Paimon table using the standard batch write builder and commit the transaction.
Key considerations:
- The blob column contains serialized BlobDescriptor bytes, not raw binary data
- Standard write and commit operations apply
- Table size remains proportional to metadata volume, not blob volume
- Multiple rows can be written in a single batch
Step 4: Reading and Descriptor Deserialization
Read the table using the standard scan-and-read pipeline. The blob column in the result contains serialized BlobDescriptor bytes. Deserialize each descriptor to recover the storage path, offset, and length of the referenced binary object.
Key considerations:
- The read path returns serialized descriptors, not raw binary data
- Deserialization reconstructs the BlobDescriptor object
- Reads are fast because blob data is not transferred
- Predicate pushdown on metadata columns works normally
Step 5: Lazy Blob Data Loading
For each deserialized BlobDescriptor, create a Blob object that provides lazy access to the underlying binary data. Use the FileUriReader with the table's FileIO configuration to read the binary content on demand. This enables selective loading of only the blobs needed for a given operation.
Key considerations:
- Blob data is loaded only when explicitly requested
- FileUriReader handles authentication and storage protocol details
- Data is read from the original storage location (no copies)
- Large blobs can be read in chunks for memory efficiency