Implementation:Apache Paimon Blob From Descriptor
| Knowledge Sources | |
|---|---|
| Domains | Data_Lake, Blob_Storage |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Concrete tool for creating lazy blob references and loading binary data on demand from external storage.
Description
Blob.from_descriptor() creates a BlobRef that wraps a UriReader and BlobDescriptor. The BlobRef does not perform any I/O at construction time -- it only stores the references needed for later data retrieval.
The key components are:
- Blob -- abstract base class defining the to_data() and new_input_stream() interface
- BlobRef -- concrete implementation that lazily loads data using a UriReader and BlobDescriptor
- FileUriReader -- wraps PyArrow FileIO for filesystem and cloud storage access (oss://, s3://, hdfs://, local paths)
- HttpUriReader -- handles http:// and https:// URIs via the Python requests library
When BlobRef.to_data() is called, it:
- Opens an input stream via the UriReader using the descriptor's URI
- Seeks to the descriptor's offset position
- Reads exactly the descriptor's length bytes
- Returns the raw bytes
When BlobRef.new_input_stream() is called, it returns a BytesIO stream positioned at the start of the blob data, enabling streaming access for large files.
Usage
Use this after deserializing BlobDescriptor objects from a table read, when the actual blob content is needed. The FileUriReader is constructed from the table's file_io object, which provides access to the configured storage backend.
Code Reference
Source Location
- Repository: Apache Paimon
- Files:
- paimon-python/pypaimon/table/row/blob.py:L130-251
- paimon-python/pypaimon/common/uri_reader.py:L32-66
Signature
class Blob(ABC):
@staticmethod
def from_descriptor(uri_reader: UriReader, descriptor: BlobDescriptor) -> 'Blob':
@abstractmethod
def to_data(self) -> bytes:
@abstractmethod
def new_input_stream(self) -> io.BytesIO:
class FileUriReader(UriReader):
def __init__(self, file_io: Any):
def new_input_stream(self, uri: str):
class BlobRef(Blob):
def to_data(self) -> bytes:
def new_input_stream(self) -> io.BytesIO:
Import
from pypaimon.table.row.blob import Blob, BlobDescriptor
from pypaimon.common.uri_reader import FileUriReader
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| uri_reader | UriReader | Yes | A FileUriReader (for filesystem/cloud storage) or HttpUriReader (for HTTP/HTTPS URLs) that provides access to the storage backend |
| descriptor | BlobDescriptor | Yes | A deserialized BlobDescriptor containing the URI, offset, and length of the external blob |
Outputs
| Name | Type | Description |
|---|---|---|
| Blob | Blob (BlobRef) | A lazy blob reference that defers I/O until to_data() or new_input_stream() is called |
| to_data() | bytes | The actual blob content, read from the external storage location specified by the descriptor |
| new_input_stream() | io.BytesIO | A BytesIO stream providing streaming access to the blob content |
Usage Examples
Basic Usage
from pypaimon.table.row.blob import Blob, BlobDescriptor
from pypaimon.common.uri_reader import FileUriReader
# Reconstruct from stored descriptor bytes
descriptor = BlobDescriptor.deserialize(stored_bytes)
# Create lazy blob reference using the table's file_io
uri_reader = FileUriReader(table.file_io)
blob = Blob.from_descriptor(uri_reader, descriptor)
# Load data on demand -- no I/O occurs until this call
data = blob.to_data()
print(f"Loaded {len(data)} bytes from {descriptor.uri}")
Streaming Access for Large Files
from pypaimon.table.row.blob import Blob, BlobDescriptor
from pypaimon.common.uri_reader import FileUriReader
descriptor = BlobDescriptor.deserialize(stored_bytes)
uri_reader = FileUriReader(table.file_io)
blob = Blob.from_descriptor(uri_reader, descriptor)
# Stream large blob content in chunks
with blob.new_input_stream() as stream:
while True:
chunk = stream.read(4096)
if not chunk:
break
process_chunk(chunk)
Selective Loading from Table Read
from pypaimon.table.row.blob import Blob, BlobDescriptor
from pypaimon.common.uri_reader import FileUriReader
# Read the table metadata
read_builder = table.new_read_builder()
scan = read_builder.new_scan()
splits = scan.plan().splits()
reader = read_builder.new_read()
arrow_table = reader.to_arrow(splits)
uri_reader = FileUriReader(table.file_io)
# Selectively load only video blobs larger than 1MB
for i in range(len(arrow_table)):
content_type = arrow_table.column('content_type')[i].as_py()
if content_type.startswith('video/'):
descriptor = BlobDescriptor.deserialize(
arrow_table.column('data')[i].as_py()
)
if descriptor.length > 1048576:
blob = Blob.from_descriptor(uri_reader, descriptor)
video_data = blob.to_data()
process_video(video_data)