Principle:Apache Paimon Row Representation

Knowledge Sources	Apache_Paimon
Domains	Data_Model, Memory
Last Updated	2026-02-08 00:00 GMT

Overview

Row representation defines internal in-memory data structures for storing and manipulating tabular data with support for multiple encoding strategies optimized for different performance characteristics.

Description

The row representation principle addresses the fundamental question of how to efficiently store and manipulate structured data in memory during query processing. Different stages of data processing have different performance requirements: some operations prioritize memory density and serialization speed, while others need fast random field access or mutability. A unified row abstraction allows the system to switch between representations transparently, choosing the most appropriate encoding for each context.

Binary row representations pack data into contiguous byte arrays with fixed-width offsets, enabling zero-copy serialization and fast network transmission. This format stores null bits in a compact header, followed by fixed-width primitive fields, with variable-length data referenced through offsets. The entire row can be serialized by simply copying its backing byte array, avoiding field-by-field encoding overhead. However, accessing individual fields requires decoding from the binary format, making this representation ideal for I/O-bound operations but less suitable for CPU-intensive transformations.

Generic row representations use object-based storage where each field is a separate object reference. This approach provides fast field access and easy mutability, making it suitable for row construction, data transformation, and operations that frequently read or modify individual fields. The trade-off is higher memory overhead due to object headers and pointer indirection. Special fields extend basic row representations with system-level metadata like sequence numbers, row kinds (insert/update/delete), and versioning information, enabling change data capture and update semantics.

Usage

Apply row representation abstractions when building query engines, data processing pipelines, or any system that needs to balance serialization performance, memory efficiency, and field access speed. Use binary representations for I/O operations and generic representations for data transformations.

Theoretical Basis

The row representation pattern provides multiple encoding strategies:

Abstract Row Interface

interface Row:
    function getFieldCount() -> integer
    function getField(index) -> value
    function isNullAt(index) -> boolean
    function getInt(index) -> integer
    function getString(index) -> string
    // ... type-specific accessors

Binary Row Layout

Binary Row Memory Layout:
+------------------+
| Null Bit Set     |  (1 bit per field, rounded to bytes)
+------------------+
| Fixed-Length     |  (primitives: int, long, double, etc.)
| Fields           |
+------------------+
| Variable-Length  |  (offset, length) pairs for strings, arrays
| Field Metadata   |
+------------------+
| Variable-Length  |  Actual bytes for variable-length data
| Field Data       |
+------------------+

Example for row (id: int, name: string, age: int):
Bytes 0-3:   Null bits (0x00 = no nulls)
Bytes 4-7:   id value (fixed 4 bytes)
Bytes 8-11:  age value (fixed 4 bytes)
Bytes 12-15: name offset (points to byte 20)
Bytes 16-19: name length (e.g., 5)
Bytes 20-24: name data ("Alice")

Generic Row Construction

function createGenericRow(schema, values):
    row = new GenericRow(schema.fieldCount())

    for i in 0 to values.length - 1:
        if values[i] == null:
            row.setNullAt(i)
        else:
            row.setField(i, values[i])

    return row

Key-Value Pair Representation

structure KeyValue:
    sequenceNumber: long      // Transaction version
    kind: enum                // +I (insert), -U (update-before), +U (update-after), -D (delete)
    key: Row                  // Primary key fields
    value: Row                // Non-key fields

function createKeyValue(key, value, seqNum, kind):
    return KeyValue(seqNum, kind, key, value)

Special Fields Extension

Sequence number: Logical timestamp for versioning and ordering
Row kind: Change data capture semantics (insert/update/delete markers)
Partition: Partition identifier for data locality optimization
Bucket: Hash bucket assignment for distributed processing

Conversion Between Representations

function convertToBinary(genericRow):
    binaryRow = new BinaryRow(genericRow.fieldCount)
    writer = new BinaryRowWriter(binaryRow)

    for i in 0 to genericRow.fieldCount - 1:
        if genericRow.isNullAt(i):
            writer.setNullAt(i)
        else:
            writer.writeField(i, genericRow.getField(i))

    return binaryRow

function convertToGeneric(binaryRow):
    genericRow = new GenericRow(binaryRow.fieldCount)

    for i in 0 to binaryRow.fieldCount - 1:
        if binaryRow.isNullAt(i):
            genericRow.setNullAt(i)
        else:
            genericRow.setField(i, binaryRow.getField(i))

    return genericRow

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment