Implementation:Apache Paimon Blob From Descriptor

Knowledge Sources	Apache Paimon
Domains	Data_Lake, Blob_Storage
Last Updated	2026-02-07 00:00 GMT

Overview

Concrete tool for creating lazy blob references and loading binary data on demand from external storage.

Description

Blob.from_descriptor() creates a BlobRef that wraps a UriReader and BlobDescriptor. The BlobRef does not perform any I/O at construction time -- it only stores the references needed for later data retrieval.

The key components are:

Blob -- abstract base class defining the to_data() and new_input_stream() interface
BlobRef -- concrete implementation that lazily loads data using a UriReader and BlobDescriptor
FileUriReader -- wraps PyArrow FileIO for filesystem and cloud storage access (oss://, s3://, hdfs://, local paths)
HttpUriReader -- handles http:// and https:// URIs via the Python requests library

When BlobRef.to_data() is called, it:

Opens an input stream via the UriReader using the descriptor's URI
Seeks to the descriptor's offset position
Reads exactly the descriptor's length bytes
Returns the raw bytes

When BlobRef.new_input_stream() is called, it returns a BytesIO stream positioned at the start of the blob data, enabling streaming access for large files.

Usage

Use this after deserializing BlobDescriptor objects from a table read, when the actual blob content is needed. The FileUriReader is constructed from the table's file_io object, which provides access to the configured storage backend.

Code Reference

Source Location

Repository: Apache Paimon
Files:
- paimon-python/pypaimon/table/row/blob.py:L130-251
- paimon-python/pypaimon/common/uri_reader.py:L32-66

Signature

class Blob(ABC):
    @staticmethod
    def from_descriptor(uri_reader: UriReader, descriptor: BlobDescriptor) -> 'Blob':

    @abstractmethod
    def to_data(self) -> bytes:
    @abstractmethod
    def new_input_stream(self) -> io.BytesIO:

class FileUriReader(UriReader):
    def __init__(self, file_io: Any):
    def new_input_stream(self, uri: str):

class BlobRef(Blob):
    def to_data(self) -> bytes:
    def new_input_stream(self) -> io.BytesIO:

Import

from pypaimon.table.row.blob import Blob, BlobDescriptor
from pypaimon.common.uri_reader import FileUriReader

I/O Contract

Inputs

Name	Type	Required	Description
uri_reader	UriReader	Yes	A FileUriReader (for filesystem/cloud storage) or HttpUriReader (for HTTP/HTTPS URLs) that provides access to the storage backend
descriptor	BlobDescriptor	Yes	A deserialized BlobDescriptor containing the URI, offset, and length of the external blob

Outputs

Name	Type	Description
Blob	Blob (BlobRef)	A lazy blob reference that defers I/O until to_data() or new_input_stream() is called
to_data()	bytes	The actual blob content, read from the external storage location specified by the descriptor
new_input_stream()	io.BytesIO	A BytesIO stream providing streaming access to the blob content

Usage Examples

Basic Usage

from pypaimon.table.row.blob import Blob, BlobDescriptor
from pypaimon.common.uri_reader import FileUriReader

# Reconstruct from stored descriptor bytes
descriptor = BlobDescriptor.deserialize(stored_bytes)

# Create lazy blob reference using the table's file_io
uri_reader = FileUriReader(table.file_io)
blob = Blob.from_descriptor(uri_reader, descriptor)

# Load data on demand -- no I/O occurs until this call
data = blob.to_data()
print(f"Loaded {len(data)} bytes from {descriptor.uri}")

Streaming Access for Large Files

from pypaimon.table.row.blob import Blob, BlobDescriptor
from pypaimon.common.uri_reader import FileUriReader

descriptor = BlobDescriptor.deserialize(stored_bytes)
uri_reader = FileUriReader(table.file_io)
blob = Blob.from_descriptor(uri_reader, descriptor)

# Stream large blob content in chunks
with blob.new_input_stream() as stream:
    while True:
        chunk = stream.read(4096)
        if not chunk:
            break
        process_chunk(chunk)

Selective Loading from Table Read

from pypaimon.table.row.blob import Blob, BlobDescriptor
from pypaimon.common.uri_reader import FileUriReader

# Read the table metadata
read_builder = table.new_read_builder()
scan = read_builder.new_scan()
splits = scan.plan().splits()
reader = read_builder.new_read()
arrow_table = reader.to_arrow(splits)

uri_reader = FileUriReader(table.file_io)

# Selectively load only video blobs larger than 1MB
for i in range(len(arrow_table)):
    content_type = arrow_table.column('content_type')[i].as_py()
    if content_type.startswith('video/'):
        descriptor = BlobDescriptor.deserialize(
            arrow_table.column('data')[i].as_py()
        )
        if descriptor.length > 1048576:
            blob = Blob.from_descriptor(uri_reader, descriptor)
            video_data = blob.to_data()
            process_video(video_data)

Related Pages

Implements Principle

Principle:Apache_Paimon_Lazy_Blob_Loading

Requires Environment

Environment:Apache_Paimon_Python_Core_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment