Implementation:Apache Paimon BlobFormatWriter Write

Knowledge Sources	Apache Paimon
Domains	Data_Lake, Blob_Storage
Last Updated	2026-02-07 00:00 GMT

Overview

Concrete tool for writing serialized blob descriptors and metadata to Paimon tables.

Description

The standard TableWrite.write_arrow() method handles blob columns by delegating to BlobFormatWriter and DataBlobWriter internally. The blob column values must be the serialized bytes produced by BlobDescriptor.serialize().

The writing pipeline works as follows:

TableWrite.write_arrow() receives a PyArrow table containing both regular columns and serialized blob descriptor bytes.
DataBlobWriter separates the blob column from regular data columns and routes them to the appropriate writers.
BlobFormatWriter writes the blob column data with a magic number header and CRC32 checksums for each entry, ensuring data integrity.
After all rows are written, prepare_commit() produces a list of CommitMessage objects.
commit() atomically finalizes the write, making all data visible to readers.

The BlobFormatWriter writes each blob entry with integrity metadata, enabling the corresponding FormatBlobReader to validate data on read.

Usage

Use this after constructing serialized BlobDescriptor bytes and assembling them into a PyArrow table. The write + commit pattern is the standard Paimon write path, extended to handle blob columns transparently.

Code Reference

Source Location

Repository: Apache Paimon
Files:
- paimon-python/pypaimon/write/table_write.py:L42-45
- paimon-python/pypaimon/write/blob_format_writer.py:L27-108
- paimon-python/pypaimon/write/writer/data_blob_writer.py:L36-317

Signature

class TableWrite:
    def write_arrow(self, table: pa.Table) -> None:

class BatchTableWrite(TableWrite):
    def prepare_commit(self) -> List[CommitMessage]:

Import

from pypaimon.write.table_write import BatchTableWrite

I/O Contract

Inputs

Name	Type	Required	Description
table	pa.Table	Yes	PyArrow table with serialized descriptor bytes in the blob column and regular values in other columns

Outputs

Name	Type	Description
None	None	write_arrow() returns None; data is written to the file store internally
commit_messages	List[CommitMessage]	prepare_commit() returns a list of commit messages for use with commit()

Usage Examples

Basic Usage

import pyarrow as pa
from pypaimon.table.row.blob import BlobDescriptor

# Create descriptors for external files
descriptors = [
    BlobDescriptor('oss://bucket/file1.mov', 0, 1048576),
    BlobDescriptor('oss://bucket/file2.mov', 0, 2097152),
]

# Build PyArrow table with serialized descriptors in the blob column
data = pa.table({
    'id': [1, 2],
    'filename': ['file1.mov', 'file2.mov'],
    'content_type': ['video/mp4', 'video/mp4'],
    'data': [d.serialize() for d in descriptors],
})

# Write and commit using the standard Paimon write path
write_builder = table.new_batch_write_builder()
writer = write_builder.new_write()
commit = write_builder.new_commit()

writer.write_arrow(data)
commit_messages = writer.prepare_commit()
commit.commit(commit_messages)

Multi-Batch Writing

import pyarrow as pa
from pypaimon.table.row.blob import BlobDescriptor

write_builder = table.new_batch_write_builder()
writer = write_builder.new_write()
commit = write_builder.new_commit()

# Write multiple batches before committing
for batch_files in file_batches:
    descriptors = [
        BlobDescriptor(uri=f['uri'], offset=0, length=f['size'])
        for f in batch_files
    ]
    batch = pa.table({
        'id': [f['id'] for f in batch_files],
        'filename': [f['name'] for f in batch_files],
        'content_type': [f['type'] for f in batch_files],
        'data': [d.serialize() for d in descriptors],
    })
    writer.write_arrow(batch)

# Commit all batches atomically
commit_messages = writer.prepare_commit()
commit.commit(commit_messages)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment