Implementation:Apache Paimon FieldBunch

Knowledge Sources	Apache_Paimon
Domains	Data Organization, Schema Evolution
Last Updated	2026-02-08 00:00 GMT

Overview

FieldBunch organizes data files into groups by field content to support data evolution and partial field updates.

Description

FieldBunch is an abstract interface with concrete implementations DataBunch and BlobBunch that organize DataFileMeta objects by field content. This abstraction supports data evolution scenarios where different files may contain different subsets of columns, particularly in tables with blob files that store partial field updates.

DataBunch represents files containing complete data records, while BlobBunch represents files containing partial field data. BlobBunch enforces constraints such as continuous row IDs, matching schema IDs, and consistent write columns across all files in the bunch. It validates that blob files are added in the correct order and that row counts do not exceed expectations.

This organization enables efficient handling of schema evolution and partial updates, allowing the system to read and merge data from files with different field compositions while maintaining consistency and data integrity.

Usage

Use FieldBunch implementations when working with data evolution scenarios where files may contain different subsets of fields. DataBunch is used for regular data files, while BlobBunch is used for blob files that store partial field updates with specific ordering and validation requirements.

Code Reference

Source Location

Repository: Apache_Paimon
File: paimon-python/pypaimon/read/reader/field_bunch.py

Signature

class FieldBunch(ABC):
    """Interface for files organized by field."""

    def row_count(self) -> int:
        """Return the total row count for this bunch."""
        ...

    def files(self) -> List[DataFileMeta]:
        """Return the list of files in this bunch."""
        ...


class DataBunch(FieldBunch):
    """Files for a single data file."""

    def __init__(self, data_file: DataFileMeta):
        self.data_file = data_file

    def row_count(self) -> int:
        ...

    def files(self) -> List[DataFileMeta]:
        ...


class BlobBunch(FieldBunch):
    """Files for partial field (blob files)."""

    def __init__(self, expected_row_count: int):
        self._files: List[DataFileMeta] = []
        self.expected_row_count = expected_row_count
        self.latest_first_row_id = -1
        self.expected_next_first_row_id = -1
        self.latest_max_sequence_number = -1
        self._row_count = 0

    def add(self, file: DataFileMeta) -> None:
        """Add a blob file to this bunch."""
        ...

    def row_count(self) -> int:
        ...

    def files(self) -> List[DataFileMeta]:
        ...

Import

from pypaimon.read.reader.field_bunch import FieldBunch, DataBunch, BlobBunch

I/O Contract

Inputs

Name	Type	Required	Description
data_file	DataFileMeta	Yes (DataBunch)	Single data file for complete records
expected_row_count	int	Yes (BlobBunch)	Expected total row count for blob files
file	DataFileMeta	Yes (add method)	Blob file to add to the bunch

Outputs

Name	Type	Description
row_count	int	Total row count across all files in the bunch
files	List[DataFileMeta]	List of data file metadata objects in the bunch

Usage Examples

from pypaimon.read.reader.field_bunch import DataBunch, BlobBunch

# DataBunch for regular data files
data_file = get_data_file_meta()
data_bunch = DataBunch(data_file)
print(f"Row count: {data_bunch.row_count()}")
print(f"Files: {data_bunch.files()}")

# BlobBunch for partial field files
blob_bunch = BlobBunch(expected_row_count=1000)

# Add blob files in order
blob_file1 = get_blob_file_meta(first_row_id=0, row_count=500)
blob_bunch.add(blob_file1)

blob_file2 = get_blob_file_meta(first_row_id=500, row_count=500)
blob_bunch.add(blob_file2)

print(f"Total rows: {blob_bunch.row_count()}")
print(f"Files in bunch: {len(blob_bunch.files())}")

# Validation errors will be raised for:
# - Non-blob files
# - Files with gaps in first_row_id
# - Files exceeding expected_row_count
# - Files with different schema_id or write_cols

Related Pages

Principle:Apache_Paimon_Split_Based_Reading

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment