Principle:Apache Paimon BTree Global Index
| Knowledge Sources | |
|---|---|
| Domains | Indexing, Storage |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
B-tree global indexing provides efficient cross-partition key lookups by maintaining a sorted, disk-based tree structure that maps primary keys to their physical locations across all table partitions.
Description
The B-tree global index addresses a critical limitation in partitioned data lake tables: locating specific rows by primary key when the partitioning scheme doesn't align with the lookup key. Traditional partitioned tables require scanning all partitions when querying by non-partition keys, making point lookups prohibitively expensive. A global index overcomes this by maintaining a separate data structure that spans all partitions, enabling direct navigation to the relevant data files and row positions.
The B-tree structure organizes keys in a balanced tree where each internal node contains key ranges and pointers to child nodes, while leaf nodes contain the actual key-to-location mappings. This hierarchical organization provides logarithmic lookup complexity, making it scalable to billions of rows. The index is persisted to storage in a block-based format, where each tree node occupies one or more blocks. During queries, the system loads only the necessary blocks from the root to the target leaf, minimizing I/O overhead.
The implementation serializes keys and values into compact binary formats, using techniques like prefix compression to reduce index size. Footer metadata stores critical information like the root block offset, tree depth, and key serialization schema. Range queries leverage the sorted nature of B-trees by traversing leaf nodes sequentially, while point lookups descend directly to the target leaf. The index structure supports various key types through pluggable serialization strategies, allowing the system to index different data types efficiently.
Usage
Apply B-tree global indexing when implementing point lookup or small range query optimization for partitioned tables, supporting secondary index functionality, or enabling efficient joins on non-partition keys. This pattern is essential when query patterns don't align with the physical partitioning scheme.
Theoretical Basis
The B-tree global index follows classical B-tree principles adapted for immutable storage:
Tree Structure
- Internal nodes store M key-range separators and M+1 child pointers
- Leaf nodes store N key-value pairs (key -> file location + row offset)
- All leaf nodes are at the same depth, ensuring balanced tree
- Nodes are stored as fixed-size blocks in sequential storage
Index Construction
function buildBTreeIndex(sortedKeyValuePairs):
leafBlocks = createLeafBlocks(sortedKeyValuePairs, LEAF_BLOCK_SIZE)
if leafBlocks.size() == 1:
return createIndexFile(leafBlocks[0], rootOffset=0)
// Build internal levels bottom-up
currentLevel = leafBlocks
while currentLevel.size() > 1:
nextLevel = []
for each group of MAX_FANOUT blocks in currentLevel:
separatorKeys = extractFirstKeys(group)
internalNode = createInternalNode(separatorKeys, blockPointers)
nextLevel.add(internalNode)
currentLevel = nextLevel
rootBlock = currentLevel[0]
return writeIndexFile(rootBlock, leafBlocks)
Lookup Algorithm
function lookup(key, indexFile):
footer = readFooter(indexFile)
currentBlockOffset = footer.rootOffset
for level in 0 to footer.treeDepth:
block = readBlock(indexFile, currentBlockOffset)
if block.isLeaf():
return block.findValue(key)
else:
// Binary search in internal node
childIndex = block.findChildIndex(key)
currentBlockOffset = block.getChildPointer(childIndex)
return null // Key not found
Range Scan
function rangeScan(startKey, endKey, indexFile):
// Find first leaf containing startKey
leafBlock = navigateToLeaf(startKey, indexFile)
results = []
while leafBlock != null:
for each (key, value) in leafBlock:
if key >= startKey and key <= endKey:
results.add(value)
else if key > endKey:
return results
leafBlock = leafBlock.nextLeafPointer
return results
Block Format
- Header: block type (leaf/internal), entry count, next block pointer (for leaves)
- Body: serialized keys and values/pointers
- Keys use prefix compression to reduce redundancy
- Fixed-size entries enable binary search within blocks