Implementation:Apache Paimon BlobFormatWriter Write
| Knowledge Sources | |
|---|---|
| Domains | Data_Lake, Blob_Storage |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Concrete tool for writing serialized blob descriptors and metadata to Paimon tables.
Description
The standard TableWrite.write_arrow() method handles blob columns by delegating to BlobFormatWriter and DataBlobWriter internally. The blob column values must be the serialized bytes produced by BlobDescriptor.serialize().
The writing pipeline works as follows:
- TableWrite.write_arrow() receives a PyArrow table containing both regular columns and serialized blob descriptor bytes.
- DataBlobWriter separates the blob column from regular data columns and routes them to the appropriate writers.
- BlobFormatWriter writes the blob column data with a magic number header and CRC32 checksums for each entry, ensuring data integrity.
- After all rows are written, prepare_commit() produces a list of CommitMessage objects.
- commit() atomically finalizes the write, making all data visible to readers.
The BlobFormatWriter writes each blob entry with integrity metadata, enabling the corresponding FormatBlobReader to validate data on read.
Usage
Use this after constructing serialized BlobDescriptor bytes and assembling them into a PyArrow table. The write + commit pattern is the standard Paimon write path, extended to handle blob columns transparently.
Code Reference
Source Location
- Repository: Apache Paimon
- Files:
- paimon-python/pypaimon/write/table_write.py:L42-45
- paimon-python/pypaimon/write/blob_format_writer.py:L27-108
- paimon-python/pypaimon/write/writer/data_blob_writer.py:L36-317
Signature
class TableWrite:
def write_arrow(self, table: pa.Table) -> None:
class BatchTableWrite(TableWrite):
def prepare_commit(self) -> List[CommitMessage]:
Import
from pypaimon.write.table_write import BatchTableWrite
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| table | pa.Table | Yes | PyArrow table with serialized descriptor bytes in the blob column and regular values in other columns |
Outputs
| Name | Type | Description |
|---|---|---|
| None | None | write_arrow() returns None; data is written to the file store internally |
| commit_messages | List[CommitMessage] | prepare_commit() returns a list of commit messages for use with commit() |
Usage Examples
Basic Usage
import pyarrow as pa
from pypaimon.table.row.blob import BlobDescriptor
# Create descriptors for external files
descriptors = [
BlobDescriptor('oss://bucket/file1.mov', 0, 1048576),
BlobDescriptor('oss://bucket/file2.mov', 0, 2097152),
]
# Build PyArrow table with serialized descriptors in the blob column
data = pa.table({
'id': [1, 2],
'filename': ['file1.mov', 'file2.mov'],
'content_type': ['video/mp4', 'video/mp4'],
'data': [d.serialize() for d in descriptors],
})
# Write and commit using the standard Paimon write path
write_builder = table.new_batch_write_builder()
writer = write_builder.new_write()
commit = write_builder.new_commit()
writer.write_arrow(data)
commit_messages = writer.prepare_commit()
commit.commit(commit_messages)
Multi-Batch Writing
import pyarrow as pa
from pypaimon.table.row.blob import BlobDescriptor
write_builder = table.new_batch_write_builder()
writer = write_builder.new_write()
commit = write_builder.new_commit()
# Write multiple batches before committing
for batch_files in file_batches:
descriptors = [
BlobDescriptor(uri=f['uri'], offset=0, length=f['size'])
for f in batch_files
]
batch = pa.table({
'id': [f['id'] for f in batch_files],
'filename': [f['name'] for f in batch_files],
'content_type': [f['type'] for f in batch_files],
'data': [d.serialize() for d in descriptors],
})
writer.write_arrow(batch)
# Commit all batches atomically
commit_messages = writer.prepare_commit()
commit.commit(commit_messages)