Principle:Apache Paimon Row Representation
| Knowledge Sources | |
|---|---|
| Domains | Data_Model, Memory |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Row representation defines internal in-memory data structures for storing and manipulating tabular data with support for multiple encoding strategies optimized for different performance characteristics.
Description
The row representation principle addresses the fundamental question of how to efficiently store and manipulate structured data in memory during query processing. Different stages of data processing have different performance requirements: some operations prioritize memory density and serialization speed, while others need fast random field access or mutability. A unified row abstraction allows the system to switch between representations transparently, choosing the most appropriate encoding for each context.
Binary row representations pack data into contiguous byte arrays with fixed-width offsets, enabling zero-copy serialization and fast network transmission. This format stores null bits in a compact header, followed by fixed-width primitive fields, with variable-length data referenced through offsets. The entire row can be serialized by simply copying its backing byte array, avoiding field-by-field encoding overhead. However, accessing individual fields requires decoding from the binary format, making this representation ideal for I/O-bound operations but less suitable for CPU-intensive transformations.
Generic row representations use object-based storage where each field is a separate object reference. This approach provides fast field access and easy mutability, making it suitable for row construction, data transformation, and operations that frequently read or modify individual fields. The trade-off is higher memory overhead due to object headers and pointer indirection. Special fields extend basic row representations with system-level metadata like sequence numbers, row kinds (insert/update/delete), and versioning information, enabling change data capture and update semantics.
Usage
Apply row representation abstractions when building query engines, data processing pipelines, or any system that needs to balance serialization performance, memory efficiency, and field access speed. Use binary representations for I/O operations and generic representations for data transformations.
Theoretical Basis
The row representation pattern provides multiple encoding strategies:
Abstract Row Interface
interface Row:
function getFieldCount() -> integer
function getField(index) -> value
function isNullAt(index) -> boolean
function getInt(index) -> integer
function getString(index) -> string
// ... type-specific accessors
Binary Row Layout
Binary Row Memory Layout:
+------------------+
| Null Bit Set | (1 bit per field, rounded to bytes)
+------------------+
| Fixed-Length | (primitives: int, long, double, etc.)
| Fields |
+------------------+
| Variable-Length | (offset, length) pairs for strings, arrays
| Field Metadata |
+------------------+
| Variable-Length | Actual bytes for variable-length data
| Field Data |
+------------------+
Example for row (id: int, name: string, age: int):
Bytes 0-3: Null bits (0x00 = no nulls)
Bytes 4-7: id value (fixed 4 bytes)
Bytes 8-11: age value (fixed 4 bytes)
Bytes 12-15: name offset (points to byte 20)
Bytes 16-19: name length (e.g., 5)
Bytes 20-24: name data ("Alice")
Generic Row Construction
function createGenericRow(schema, values):
row = new GenericRow(schema.fieldCount())
for i in 0 to values.length - 1:
if values[i] == null:
row.setNullAt(i)
else:
row.setField(i, values[i])
return row
Key-Value Pair Representation
structure KeyValue:
sequenceNumber: long // Transaction version
kind: enum // +I (insert), -U (update-before), +U (update-after), -D (delete)
key: Row // Primary key fields
value: Row // Non-key fields
function createKeyValue(key, value, seqNum, kind):
return KeyValue(seqNum, kind, key, value)
Special Fields Extension
- Sequence number: Logical timestamp for versioning and ordering
- Row kind: Change data capture semantics (insert/update/delete markers)
- Partition: Partition identifier for data locality optimization
- Bucket: Hash bucket assignment for distributed processing
Conversion Between Representations
function convertToBinary(genericRow):
binaryRow = new BinaryRow(genericRow.fieldCount)
writer = new BinaryRowWriter(binaryRow)
for i in 0 to genericRow.fieldCount - 1:
if genericRow.isNullAt(i):
writer.setNullAt(i)
else:
writer.writeField(i, genericRow.getField(i))
return binaryRow
function convertToGeneric(binaryRow):
genericRow = new GenericRow(binaryRow.fieldCount)
for i in 0 to binaryRow.fieldCount - 1:
if binaryRow.isNullAt(i):
genericRow.setNullAt(i)
else:
genericRow.setField(i, binaryRow.getField(i))
return genericRow