Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Apache Paimon BlobFormatWriter Write

From Leeroopedia


Knowledge Sources
Domains Data_Lake, Blob_Storage
Last Updated 2026-02-07 00:00 GMT

Overview

Concrete tool for writing serialized blob descriptors and metadata to Paimon tables.

Description

The standard TableWrite.write_arrow() method handles blob columns by delegating to BlobFormatWriter and DataBlobWriter internally. The blob column values must be the serialized bytes produced by BlobDescriptor.serialize().

The writing pipeline works as follows:

  1. TableWrite.write_arrow() receives a PyArrow table containing both regular columns and serialized blob descriptor bytes.
  2. DataBlobWriter separates the blob column from regular data columns and routes them to the appropriate writers.
  3. BlobFormatWriter writes the blob column data with a magic number header and CRC32 checksums for each entry, ensuring data integrity.
  4. After all rows are written, prepare_commit() produces a list of CommitMessage objects.
  5. commit() atomically finalizes the write, making all data visible to readers.

The BlobFormatWriter writes each blob entry with integrity metadata, enabling the corresponding FormatBlobReader to validate data on read.

Usage

Use this after constructing serialized BlobDescriptor bytes and assembling them into a PyArrow table. The write + commit pattern is the standard Paimon write path, extended to handle blob columns transparently.

Code Reference

Source Location

  • Repository: Apache Paimon
  • Files:
    • paimon-python/pypaimon/write/table_write.py:L42-45
    • paimon-python/pypaimon/write/blob_format_writer.py:L27-108
    • paimon-python/pypaimon/write/writer/data_blob_writer.py:L36-317

Signature

class TableWrite:
    def write_arrow(self, table: pa.Table) -> None:

class BatchTableWrite(TableWrite):
    def prepare_commit(self) -> List[CommitMessage]:

Import

from pypaimon.write.table_write import BatchTableWrite

I/O Contract

Inputs

Name Type Required Description
table pa.Table Yes PyArrow table with serialized descriptor bytes in the blob column and regular values in other columns

Outputs

Name Type Description
None None write_arrow() returns None; data is written to the file store internally
commit_messages List[CommitMessage] prepare_commit() returns a list of commit messages for use with commit()

Usage Examples

Basic Usage

import pyarrow as pa
from pypaimon.table.row.blob import BlobDescriptor

# Create descriptors for external files
descriptors = [
    BlobDescriptor('oss://bucket/file1.mov', 0, 1048576),
    BlobDescriptor('oss://bucket/file2.mov', 0, 2097152),
]

# Build PyArrow table with serialized descriptors in the blob column
data = pa.table({
    'id': [1, 2],
    'filename': ['file1.mov', 'file2.mov'],
    'content_type': ['video/mp4', 'video/mp4'],
    'data': [d.serialize() for d in descriptors],
})

# Write and commit using the standard Paimon write path
write_builder = table.new_batch_write_builder()
writer = write_builder.new_write()
commit = write_builder.new_commit()

writer.write_arrow(data)
commit_messages = writer.prepare_commit()
commit.commit(commit_messages)

Multi-Batch Writing

import pyarrow as pa
from pypaimon.table.row.blob import BlobDescriptor

write_builder = table.new_batch_write_builder()
writer = write_builder.new_write()
commit = write_builder.new_commit()

# Write multiple batches before committing
for batch_files in file_batches:
    descriptors = [
        BlobDescriptor(uri=f['uri'], offset=0, length=f['size'])
        for f in batch_files
    ]
    batch = pa.table({
        'id': [f['id'] for f in batch_files],
        'filename': [f['name'] for f in batch_files],
        'content_type': [f['type'] for f in batch_files],
        'data': [d.serialize() for d in descriptors],
    })
    writer.write_arrow(batch)

# Commit all batches atomically
commit_messages = writer.prepare_commit()
commit.commit(commit_messages)

Related Pages

Implements Principle

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment