Principle:Apache Paimon Lazy Blob Loading

Knowledge Sources	Apache Paimon
Domains	Data_Lake, Blob_Storage
Last Updated	2026-02-07 00:00 GMT

Overview

Mechanism for lazily loading actual binary object data on demand from external storage using blob descriptors.

Description

Lazy blob loading defers the retrieval of actual binary data until it is explicitly requested. Given a BlobDescriptor with URI, offset, and length, Blob.from_descriptor() creates a BlobRef that holds the reference without loading any data. The actual I/O occurs only when one of the following methods is called:

to_data() -- reads the entire referenced byte range and returns it as a bytes object
new_input_stream() -- returns a BytesIO stream for streaming access to the blob content

The UriReader abstraction supports multiple storage backends:

FileUriReader -- handles filesystem and cloud storage URIs (via PyArrow FileIO), supporting schemes like oss://, s3://, hdfs://, and local paths
HttpUriReader -- handles http:// and https:// URLs using the Python requests library

This lazy approach prevents loading potentially large binary files during metadata queries. When reading a blob-enabled table with thousands of rows, only the lightweight descriptors are loaded. The actual multi-megabyte or multi-gigabyte blobs are fetched only for the specific rows that need them.

Usage

Use when actual blob content is needed after reading blob descriptors from a Paimon table. This enables selective loading of only the blobs that are actually needed, which is especially important when:

Browsing or filtering table metadata before deciding which blobs to load
Processing a subset of blobs from a large table
Streaming large blob content that does not fit in memory
Implementing pagination or lazy scrolling in user interfaces

This is the fifth and final step in the blob storage pipeline, following schema definition, descriptor construction, metadata writing, and descriptor deserialization.

Theoretical Basis

Lazy evaluation defers computation until the result is needed. For large binary objects, this prevents unnecessary memory allocation and I/O. A table scan that reads metadata for 10,000 rows loads only the descriptor bytes (a few dozen bytes each), not the actual blob data (which could be gigabytes per row).

The BlobRef pattern implements a proxy that delays loading until data access is requested. The proxy exposes the same interface as an in-memory blob (to_data(), new_input_stream()) but internally holds only a reference (the descriptor) until one of these methods is invoked.

The UriReader abstraction follows the strategy pattern, allowing the blob loading mechanism to support multiple storage backends without modifying the core BlobRef logic. New storage backends can be added by implementing the UriReader interface, without changing any existing code.

The new_input_stream() method enables streaming access, which is critical for blobs that are too large to fit in memory. By returning a BytesIO stream, the caller can read the blob in chunks, processing each chunk before loading the next. This follows the iterator pattern for memory-efficient processing of large data.

Related Pages

Implemented By

Implementation:Apache_Paimon_Blob_From_Descriptor

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment