Implementation:Apache Paimon FieldBunch
| Knowledge Sources | |
|---|---|
| Domains | Data Organization, Schema Evolution |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
FieldBunch organizes data files into groups by field content to support data evolution and partial field updates.
Description
FieldBunch is an abstract interface with concrete implementations DataBunch and BlobBunch that organize DataFileMeta objects by field content. This abstraction supports data evolution scenarios where different files may contain different subsets of columns, particularly in tables with blob files that store partial field updates.
DataBunch represents files containing complete data records, while BlobBunch represents files containing partial field data. BlobBunch enforces constraints such as continuous row IDs, matching schema IDs, and consistent write columns across all files in the bunch. It validates that blob files are added in the correct order and that row counts do not exceed expectations.
This organization enables efficient handling of schema evolution and partial updates, allowing the system to read and merge data from files with different field compositions while maintaining consistency and data integrity.
Usage
Use FieldBunch implementations when working with data evolution scenarios where files may contain different subsets of fields. DataBunch is used for regular data files, while BlobBunch is used for blob files that store partial field updates with specific ordering and validation requirements.
Code Reference
Source Location
- Repository: Apache_Paimon
- File: paimon-python/pypaimon/read/reader/field_bunch.py
Signature
class FieldBunch(ABC):
"""Interface for files organized by field."""
def row_count(self) -> int:
"""Return the total row count for this bunch."""
...
def files(self) -> List[DataFileMeta]:
"""Return the list of files in this bunch."""
...
class DataBunch(FieldBunch):
"""Files for a single data file."""
def __init__(self, data_file: DataFileMeta):
self.data_file = data_file
def row_count(self) -> int:
...
def files(self) -> List[DataFileMeta]:
...
class BlobBunch(FieldBunch):
"""Files for partial field (blob files)."""
def __init__(self, expected_row_count: int):
self._files: List[DataFileMeta] = []
self.expected_row_count = expected_row_count
self.latest_first_row_id = -1
self.expected_next_first_row_id = -1
self.latest_max_sequence_number = -1
self._row_count = 0
def add(self, file: DataFileMeta) -> None:
"""Add a blob file to this bunch."""
...
def row_count(self) -> int:
...
def files(self) -> List[DataFileMeta]:
...
Import
from pypaimon.read.reader.field_bunch import FieldBunch, DataBunch, BlobBunch
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| data_file | DataFileMeta | Yes (DataBunch) | Single data file for complete records |
| expected_row_count | int | Yes (BlobBunch) | Expected total row count for blob files |
| file | DataFileMeta | Yes (add method) | Blob file to add to the bunch |
Outputs
| Name | Type | Description |
|---|---|---|
| row_count | int | Total row count across all files in the bunch |
| files | List[DataFileMeta] | List of data file metadata objects in the bunch |
Usage Examples
from pypaimon.read.reader.field_bunch import DataBunch, BlobBunch
# DataBunch for regular data files
data_file = get_data_file_meta()
data_bunch = DataBunch(data_file)
print(f"Row count: {data_bunch.row_count()}")
print(f"Files: {data_bunch.files()}")
# BlobBunch for partial field files
blob_bunch = BlobBunch(expected_row_count=1000)
# Add blob files in order
blob_file1 = get_blob_file_meta(first_row_id=0, row_count=500)
blob_bunch.add(blob_file1)
blob_file2 = get_blob_file_meta(first_row_id=500, row_count=500)
blob_bunch.add(blob_file2)
print(f"Total rows: {blob_bunch.row_count()}")
print(f"Files in bunch: {len(blob_bunch.files())}")
# Validation errors will be raised for:
# - Non-blob files
# - Files with gaps in first_row_id
# - Files exceeding expected_row_count
# - Files with different schema_id or write_cols