Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Apache Paimon Blob From Descriptor

From Leeroopedia


Knowledge Sources
Domains Data_Lake, Blob_Storage
Last Updated 2026-02-07 00:00 GMT

Overview

Concrete tool for creating lazy blob references and loading binary data on demand from external storage.

Description

Blob.from_descriptor() creates a BlobRef that wraps a UriReader and BlobDescriptor. The BlobRef does not perform any I/O at construction time -- it only stores the references needed for later data retrieval.

The key components are:

  • Blob -- abstract base class defining the to_data() and new_input_stream() interface
  • BlobRef -- concrete implementation that lazily loads data using a UriReader and BlobDescriptor
  • FileUriReader -- wraps PyArrow FileIO for filesystem and cloud storage access (oss://, s3://, hdfs://, local paths)
  • HttpUriReader -- handles http:// and https:// URIs via the Python requests library

When BlobRef.to_data() is called, it:

  1. Opens an input stream via the UriReader using the descriptor's URI
  2. Seeks to the descriptor's offset position
  3. Reads exactly the descriptor's length bytes
  4. Returns the raw bytes

When BlobRef.new_input_stream() is called, it returns a BytesIO stream positioned at the start of the blob data, enabling streaming access for large files.

Usage

Use this after deserializing BlobDescriptor objects from a table read, when the actual blob content is needed. The FileUriReader is constructed from the table's file_io object, which provides access to the configured storage backend.

Code Reference

Source Location

  • Repository: Apache Paimon
  • Files:
    • paimon-python/pypaimon/table/row/blob.py:L130-251
    • paimon-python/pypaimon/common/uri_reader.py:L32-66

Signature

class Blob(ABC):
    @staticmethod
    def from_descriptor(uri_reader: UriReader, descriptor: BlobDescriptor) -> 'Blob':

    @abstractmethod
    def to_data(self) -> bytes:
    @abstractmethod
    def new_input_stream(self) -> io.BytesIO:

class FileUriReader(UriReader):
    def __init__(self, file_io: Any):
    def new_input_stream(self, uri: str):

class BlobRef(Blob):
    def to_data(self) -> bytes:
    def new_input_stream(self) -> io.BytesIO:

Import

from pypaimon.table.row.blob import Blob, BlobDescriptor
from pypaimon.common.uri_reader import FileUriReader

I/O Contract

Inputs

Name Type Required Description
uri_reader UriReader Yes A FileUriReader (for filesystem/cloud storage) or HttpUriReader (for HTTP/HTTPS URLs) that provides access to the storage backend
descriptor BlobDescriptor Yes A deserialized BlobDescriptor containing the URI, offset, and length of the external blob

Outputs

Name Type Description
Blob Blob (BlobRef) A lazy blob reference that defers I/O until to_data() or new_input_stream() is called
to_data() bytes The actual blob content, read from the external storage location specified by the descriptor
new_input_stream() io.BytesIO A BytesIO stream providing streaming access to the blob content

Usage Examples

Basic Usage

from pypaimon.table.row.blob import Blob, BlobDescriptor
from pypaimon.common.uri_reader import FileUriReader

# Reconstruct from stored descriptor bytes
descriptor = BlobDescriptor.deserialize(stored_bytes)

# Create lazy blob reference using the table's file_io
uri_reader = FileUriReader(table.file_io)
blob = Blob.from_descriptor(uri_reader, descriptor)

# Load data on demand -- no I/O occurs until this call
data = blob.to_data()
print(f"Loaded {len(data)} bytes from {descriptor.uri}")

Streaming Access for Large Files

from pypaimon.table.row.blob import Blob, BlobDescriptor
from pypaimon.common.uri_reader import FileUriReader

descriptor = BlobDescriptor.deserialize(stored_bytes)
uri_reader = FileUriReader(table.file_io)
blob = Blob.from_descriptor(uri_reader, descriptor)

# Stream large blob content in chunks
with blob.new_input_stream() as stream:
    while True:
        chunk = stream.read(4096)
        if not chunk:
            break
        process_chunk(chunk)

Selective Loading from Table Read

from pypaimon.table.row.blob import Blob, BlobDescriptor
from pypaimon.common.uri_reader import FileUriReader

# Read the table metadata
read_builder = table.new_read_builder()
scan = read_builder.new_scan()
splits = scan.plan().splits()
reader = read_builder.new_read()
arrow_table = reader.to_arrow(splits)

uri_reader = FileUriReader(table.file_io)

# Selectively load only video blobs larger than 1MB
for i in range(len(arrow_table)):
    content_type = arrow_table.column('content_type')[i].as_py()
    if content_type.startswith('video/'):
        descriptor = BlobDescriptor.deserialize(
            arrow_table.column('data')[i].as_py()
        )
        if descriptor.length > 1048576:
            blob = Blob.from_descriptor(uri_reader, descriptor)
            video_data = blob.to_data()
            process_video(video_data)

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment