Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Apache Paimon GenericRow

From Leeroopedia
Revision as of 14:21, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Apache_Paimon_GenericRow.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains Data Structures, Serialization
Last Updated 2026-02-08 00:00 GMT

Overview

GenericRow provides the primary in-memory row representation with its binary serialization/deserialization logic for Paimon's internal row format, enabling efficient data exchange between Python and Java runtimes.

Description

The module implements three core classes. `GenericRow` extends `InternalRow` as a simple dataclass with a `values` list (Python objects), `fields` list (DataField definitions), and `row_kind` (INSERT, UPDATE_BEFORE, UPDATE_AFTER, DELETE). It provides `get_field()` for indexed access, `to_dict()` for conversion to Python dictionaries, and standard Python container methods. `GenericRowDeserializer` parses Paimon's binary row format: it reads an arity prefix (4 bytes big-endian), computes null bitset width (8 bytes per 64 fields plus header), checks null bits for each field position, and dispatches `parse_field_value()` based on type name to extract values from the fixed-part (8 bytes per field) or variable-part sections. Strings and binary data use two encoding modes: compact encoding stores values up to 7 bytes inline in the fixed part (with high bit set in byte 7 indicating compact mode), while larger values use offset+length in the fixed part pointing into the variable-length section. All numeric types use little-endian encoding. `GenericRowSerializer` performs the reverse operation: it constructs a fixed-part bytearray with null bitset and 8-byte field slots, accumulates variable-length data in a separate buffer, writes compact strings/binaries inline or generates offset+length pointers, serializes scalar types in-place, and concatenates everything with a 4-byte arity prefix. Special handling includes DECIMAL (stored as unscaled long with scale from type definition), TIMESTAMP (milliseconds since epoch), DATE (days since epoch), TIME (milliseconds since midnight), and BLOB (via BlobData wrapper).

This implementation is critical for the entire SDK as GenericRow is used for in-memory row construction and manipulation, while the serializer/deserializer enable reading and writing Paimon's binary row format that's compatible with the Java implementation (used in manifest files, statistics, and index keys).

Usage

GenericRow is used throughout the SDK for representing rows in memory. The serializer/deserializer are used internally when reading/writing manifests, statistics, and bucket hash keys.

Code Reference

Source Location

Signature

@dataclass
class GenericRow(InternalRow):
    values: List[Any]
    fields: List[DataField]
    row_kind: RowKind = RowKind.INSERT

    def get_field(self, pos: int) -> Any: ...
    def get_row_kind(self) -> RowKind: ...
    def to_dict(self) -> dict: ...
    def __len__(self) -> int: ...

class GenericRowDeserializer:
    HEADER_SIZE_IN_BITS = 8
    MAX_FIX_PART_DATA_SIZE = 7

    @classmethod
    def from_bytes(cls, bytes_data: bytes,
                   data_fields: List[DataField]) -> GenericRow: ...
    @classmethod
    def parse_field_value(cls, bytes_data: bytes, base_offset: int,
                          null_bits_size_in_bytes: int,
                          pos: int, data_type: DataType) -> Any: ...

class GenericRowSerializer:
    HEADER_SIZE_IN_BITS = 8
    MAX_FIX_PART_DATA_SIZE = 7

    @classmethod
    def to_bytes(cls, row: Union[GenericRow, BinaryRow]) -> bytes: ...
    @classmethod
    def _serialize_field_value(cls, value: Any, data_type: AtomicType) -> bytes: ...

Import

from pypaimon.table.row.generic_row import GenericRow, GenericRowDeserializer, GenericRowSerializer

I/O Contract

Inputs

Name Type Required Description
values List[Any] yes Row field values
fields List[DataField] yes Field definitions
row_kind RowKind no Row operation type (default: INSERT)
bytes_data bytes yes (for deserialization) Binary row data

Outputs

Name Type Description
GenericRow GenericRow In-memory row object
bytes bytes Serialized binary row data
dict dict Row as Python dictionary

Usage Examples

Create Row

from pypaimon.table.row.generic_row import GenericRow, RowKind
from pypaimon.schema.data_types import DataField, AtomicType
from datetime import date

# Define fields
fields = [
    DataField(0, "user_id", AtomicType("INT")),
    DataField(1, "name", AtomicType("STRING")),
    DataField(2, "email", AtomicType("STRING")),
    DataField(3, "created_at", AtomicType("DATE"))
]

# Create row
row = GenericRow(
    values=[123, "Alice", "alice@example.com", date(2024, 1, 15)],
    fields=fields,
    row_kind=RowKind.INSERT
)

# Access fields
print(row.get_field(0))  # 123
print(row.get_field(1))  # "Alice"
print(len(row))  # 4

# Convert to dict
row_dict = row.to_dict()
print(row_dict)
# {'user_id': 123, 'name': 'Alice', 'email': 'alice@example.com', ...}

Serialize Row

from pypaimon.table.row.generic_row import GenericRowSerializer

# Serialize to bytes
row_bytes = GenericRowSerializer.to_bytes(row)
print(f"Serialized size: {len(row_bytes)} bytes")

# Binary format compatible with Java Paimon
# Can be written to manifest files, index keys, etc.

Deserialize Row

from pypaimon.table.row.generic_row import GenericRowDeserializer

# Deserialize from bytes
deserialized_row = GenericRowDeserializer.from_bytes(row_bytes, fields)

print(deserialized_row.get_field(0))  # 123
print(deserialized_row.get_field(1))  # "Alice"
print(deserialized_row.row_kind)  # RowKind.INSERT

# Round-trip preserves all values
assert deserialized_row.values == row.values

Handle Nulls

# Create row with null values
row_with_nulls = GenericRow(
    values=[456, "Bob", None, None],
    fields=fields
)

# Serialize and deserialize
row_bytes = GenericRowSerializer.to_bytes(row_with_nulls)
deserialized = GenericRowDeserializer.from_bytes(row_bytes, fields)

print(deserialized.get_field(2))  # None
print(deserialized.get_field(3))  # None

Update Operations

# Create update operations
old_row = GenericRow(
    values=[123, "Alice", "old@example.com", date(2024, 1, 15)],
    fields=fields,
    row_kind=RowKind.UPDATE_BEFORE
)

new_row = GenericRow(
    values=[123, "Alice", "new@example.com", date(2024, 1, 15)],
    fields=fields,
    row_kind=RowKind.UPDATE_AFTER
)

# Serialize for changelog stream
old_bytes = GenericRowSerializer.to_bytes(old_row)
new_bytes = GenericRowSerializer.to_bytes(new_row)

Complex Types

from decimal import Decimal
from datetime import datetime, time

# Fields with various types
complex_fields = [
    DataField(0, "id", AtomicType("BIGINT")),
    DataField(1, "price", AtomicType("DECIMAL(10,2)")),
    DataField(2, "timestamp", AtomicType("TIMESTAMP")),
    DataField(3, "time", AtomicType("TIME")),
    DataField(4, "data", AtomicType("BYTES"))
]

# Create row with complex values
complex_row = GenericRow(
    values=[
        9876543210,
        Decimal("123.45"),
        datetime(2024, 1, 15, 10, 30, 0),
        time(14, 30, 0),
        b"binary data"
    ],
    fields=complex_fields
)

# Serialize and deserialize
row_bytes = GenericRowSerializer.to_bytes(complex_row)
deserialized = GenericRowDeserializer.from_bytes(row_bytes, complex_fields)

print(deserialized.get_field(1))  # Decimal('123.45')
print(deserialized.get_field(2))  # datetime(2024, 1, 15, 10, 30, 0)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment