Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Apache Paimon FieldBunch

From Leeroopedia


Knowledge Sources
Domains Data Organization, Schema Evolution
Last Updated 2026-02-08 00:00 GMT

Overview

FieldBunch organizes data files into groups by field content to support data evolution and partial field updates.

Description

FieldBunch is an abstract interface with concrete implementations DataBunch and BlobBunch that organize DataFileMeta objects by field content. This abstraction supports data evolution scenarios where different files may contain different subsets of columns, particularly in tables with blob files that store partial field updates.

DataBunch represents files containing complete data records, while BlobBunch represents files containing partial field data. BlobBunch enforces constraints such as continuous row IDs, matching schema IDs, and consistent write columns across all files in the bunch. It validates that blob files are added in the correct order and that row counts do not exceed expectations.

This organization enables efficient handling of schema evolution and partial updates, allowing the system to read and merge data from files with different field compositions while maintaining consistency and data integrity.

Usage

Use FieldBunch implementations when working with data evolution scenarios where files may contain different subsets of fields. DataBunch is used for regular data files, while BlobBunch is used for blob files that store partial field updates with specific ordering and validation requirements.

Code Reference

Source Location

Signature

class FieldBunch(ABC):
    """Interface for files organized by field."""

    def row_count(self) -> int:
        """Return the total row count for this bunch."""
        ...

    def files(self) -> List[DataFileMeta]:
        """Return the list of files in this bunch."""
        ...


class DataBunch(FieldBunch):
    """Files for a single data file."""

    def __init__(self, data_file: DataFileMeta):
        self.data_file = data_file

    def row_count(self) -> int:
        ...

    def files(self) -> List[DataFileMeta]:
        ...


class BlobBunch(FieldBunch):
    """Files for partial field (blob files)."""

    def __init__(self, expected_row_count: int):
        self._files: List[DataFileMeta] = []
        self.expected_row_count = expected_row_count
        self.latest_first_row_id = -1
        self.expected_next_first_row_id = -1
        self.latest_max_sequence_number = -1
        self._row_count = 0

    def add(self, file: DataFileMeta) -> None:
        """Add a blob file to this bunch."""
        ...

    def row_count(self) -> int:
        ...

    def files(self) -> List[DataFileMeta]:
        ...

Import

from pypaimon.read.reader.field_bunch import FieldBunch, DataBunch, BlobBunch

I/O Contract

Inputs

Name Type Required Description
data_file DataFileMeta Yes (DataBunch) Single data file for complete records
expected_row_count int Yes (BlobBunch) Expected total row count for blob files
file DataFileMeta Yes (add method) Blob file to add to the bunch

Outputs

Name Type Description
row_count int Total row count across all files in the bunch
files List[DataFileMeta] List of data file metadata objects in the bunch

Usage Examples

from pypaimon.read.reader.field_bunch import DataBunch, BlobBunch

# DataBunch for regular data files
data_file = get_data_file_meta()
data_bunch = DataBunch(data_file)
print(f"Row count: {data_bunch.row_count()}")
print(f"Files: {data_bunch.files()}")

# BlobBunch for partial field files
blob_bunch = BlobBunch(expected_row_count=1000)

# Add blob files in order
blob_file1 = get_blob_file_meta(first_row_id=0, row_count=500)
blob_bunch.add(blob_file1)

blob_file2 = get_blob_file_meta(first_row_id=500, row_count=500)
blob_bunch.add(blob_file2)

print(f"Total rows: {blob_bunch.row_count()}")
print(f"Files in bunch: {len(blob_bunch.files())}")

# Validation errors will be raised for:
# - Non-blob files
# - Files with gaps in first_row_id
# - Files exceeding expected_row_count
# - Files with different schema_id or write_cols

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment