Implementation:Apache Paimon GenericRow
| Knowledge Sources | |
|---|---|
| Domains | Data Structures, Serialization |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
GenericRow provides the primary in-memory row representation with its binary serialization/deserialization logic for Paimon's internal row format, enabling efficient data exchange between Python and Java runtimes.
Description
The module implements three core classes. `GenericRow` extends `InternalRow` as a simple dataclass with a `values` list (Python objects), `fields` list (DataField definitions), and `row_kind` (INSERT, UPDATE_BEFORE, UPDATE_AFTER, DELETE). It provides `get_field()` for indexed access, `to_dict()` for conversion to Python dictionaries, and standard Python container methods. `GenericRowDeserializer` parses Paimon's binary row format: it reads an arity prefix (4 bytes big-endian), computes null bitset width (8 bytes per 64 fields plus header), checks null bits for each field position, and dispatches `parse_field_value()` based on type name to extract values from the fixed-part (8 bytes per field) or variable-part sections. Strings and binary data use two encoding modes: compact encoding stores values up to 7 bytes inline in the fixed part (with high bit set in byte 7 indicating compact mode), while larger values use offset+length in the fixed part pointing into the variable-length section. All numeric types use little-endian encoding. `GenericRowSerializer` performs the reverse operation: it constructs a fixed-part bytearray with null bitset and 8-byte field slots, accumulates variable-length data in a separate buffer, writes compact strings/binaries inline or generates offset+length pointers, serializes scalar types in-place, and concatenates everything with a 4-byte arity prefix. Special handling includes DECIMAL (stored as unscaled long with scale from type definition), TIMESTAMP (milliseconds since epoch), DATE (days since epoch), TIME (milliseconds since midnight), and BLOB (via BlobData wrapper).
This implementation is critical for the entire SDK as GenericRow is used for in-memory row construction and manipulation, while the serializer/deserializer enable reading and writing Paimon's binary row format that's compatible with the Java implementation (used in manifest files, statistics, and index keys).
Usage
GenericRow is used throughout the SDK for representing rows in memory. The serializer/deserializer are used internally when reading/writing manifests, statistics, and bucket hash keys.
Code Reference
Source Location
- Repository: Apache_Paimon
- File: paimon-python/pypaimon/table/row/generic_row.py
Signature
@dataclass
class GenericRow(InternalRow):
values: List[Any]
fields: List[DataField]
row_kind: RowKind = RowKind.INSERT
def get_field(self, pos: int) -> Any: ...
def get_row_kind(self) -> RowKind: ...
def to_dict(self) -> dict: ...
def __len__(self) -> int: ...
class GenericRowDeserializer:
HEADER_SIZE_IN_BITS = 8
MAX_FIX_PART_DATA_SIZE = 7
@classmethod
def from_bytes(cls, bytes_data: bytes,
data_fields: List[DataField]) -> GenericRow: ...
@classmethod
def parse_field_value(cls, bytes_data: bytes, base_offset: int,
null_bits_size_in_bytes: int,
pos: int, data_type: DataType) -> Any: ...
class GenericRowSerializer:
HEADER_SIZE_IN_BITS = 8
MAX_FIX_PART_DATA_SIZE = 7
@classmethod
def to_bytes(cls, row: Union[GenericRow, BinaryRow]) -> bytes: ...
@classmethod
def _serialize_field_value(cls, value: Any, data_type: AtomicType) -> bytes: ...
Import
from pypaimon.table.row.generic_row import GenericRow, GenericRowDeserializer, GenericRowSerializer
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| values | List[Any] | yes | Row field values |
| fields | List[DataField] | yes | Field definitions |
| row_kind | RowKind | no | Row operation type (default: INSERT) |
| bytes_data | bytes | yes (for deserialization) | Binary row data |
Outputs
| Name | Type | Description |
|---|---|---|
| GenericRow | GenericRow | In-memory row object |
| bytes | bytes | Serialized binary row data |
| dict | dict | Row as Python dictionary |
Usage Examples
Create Row
from pypaimon.table.row.generic_row import GenericRow, RowKind
from pypaimon.schema.data_types import DataField, AtomicType
from datetime import date
# Define fields
fields = [
DataField(0, "user_id", AtomicType("INT")),
DataField(1, "name", AtomicType("STRING")),
DataField(2, "email", AtomicType("STRING")),
DataField(3, "created_at", AtomicType("DATE"))
]
# Create row
row = GenericRow(
values=[123, "Alice", "alice@example.com", date(2024, 1, 15)],
fields=fields,
row_kind=RowKind.INSERT
)
# Access fields
print(row.get_field(0)) # 123
print(row.get_field(1)) # "Alice"
print(len(row)) # 4
# Convert to dict
row_dict = row.to_dict()
print(row_dict)
# {'user_id': 123, 'name': 'Alice', 'email': 'alice@example.com', ...}
Serialize Row
from pypaimon.table.row.generic_row import GenericRowSerializer
# Serialize to bytes
row_bytes = GenericRowSerializer.to_bytes(row)
print(f"Serialized size: {len(row_bytes)} bytes")
# Binary format compatible with Java Paimon
# Can be written to manifest files, index keys, etc.
Deserialize Row
from pypaimon.table.row.generic_row import GenericRowDeserializer
# Deserialize from bytes
deserialized_row = GenericRowDeserializer.from_bytes(row_bytes, fields)
print(deserialized_row.get_field(0)) # 123
print(deserialized_row.get_field(1)) # "Alice"
print(deserialized_row.row_kind) # RowKind.INSERT
# Round-trip preserves all values
assert deserialized_row.values == row.values
Handle Nulls
# Create row with null values
row_with_nulls = GenericRow(
values=[456, "Bob", None, None],
fields=fields
)
# Serialize and deserialize
row_bytes = GenericRowSerializer.to_bytes(row_with_nulls)
deserialized = GenericRowDeserializer.from_bytes(row_bytes, fields)
print(deserialized.get_field(2)) # None
print(deserialized.get_field(3)) # None
Update Operations
# Create update operations
old_row = GenericRow(
values=[123, "Alice", "old@example.com", date(2024, 1, 15)],
fields=fields,
row_kind=RowKind.UPDATE_BEFORE
)
new_row = GenericRow(
values=[123, "Alice", "new@example.com", date(2024, 1, 15)],
fields=fields,
row_kind=RowKind.UPDATE_AFTER
)
# Serialize for changelog stream
old_bytes = GenericRowSerializer.to_bytes(old_row)
new_bytes = GenericRowSerializer.to_bytes(new_row)
Complex Types
from decimal import Decimal
from datetime import datetime, time
# Fields with various types
complex_fields = [
DataField(0, "id", AtomicType("BIGINT")),
DataField(1, "price", AtomicType("DECIMAL(10,2)")),
DataField(2, "timestamp", AtomicType("TIMESTAMP")),
DataField(3, "time", AtomicType("TIME")),
DataField(4, "data", AtomicType("BYTES"))
]
# Create row with complex values
complex_row = GenericRow(
values=[
9876543210,
Decimal("123.45"),
datetime(2024, 1, 15, 10, 30, 0),
time(14, 30, 0),
b"binary data"
],
fields=complex_fields
)
# Serialize and deserialize
row_bytes = GenericRowSerializer.to_bytes(complex_row)
deserialized = GenericRowDeserializer.from_bytes(row_bytes, complex_fields)
print(deserialized.get_field(1)) # Decimal('123.45')
print(deserialized.get_field(2)) # datetime(2024, 1, 15, 10, 30, 0)