Principle:Apache Paimon BTree Global Index

Knowledge Sources	Apache_Paimon
Domains	Indexing, Storage
Last Updated	2026-02-08 00:00 GMT

Overview

B-tree global indexing provides efficient cross-partition key lookups by maintaining a sorted, disk-based tree structure that maps primary keys to their physical locations across all table partitions.

Description

The B-tree global index addresses a critical limitation in partitioned data lake tables: locating specific rows by primary key when the partitioning scheme doesn't align with the lookup key. Traditional partitioned tables require scanning all partitions when querying by non-partition keys, making point lookups prohibitively expensive. A global index overcomes this by maintaining a separate data structure that spans all partitions, enabling direct navigation to the relevant data files and row positions.

The B-tree structure organizes keys in a balanced tree where each internal node contains key ranges and pointers to child nodes, while leaf nodes contain the actual key-to-location mappings. This hierarchical organization provides logarithmic lookup complexity, making it scalable to billions of rows. The index is persisted to storage in a block-based format, where each tree node occupies one or more blocks. During queries, the system loads only the necessary blocks from the root to the target leaf, minimizing I/O overhead.

The implementation serializes keys and values into compact binary formats, using techniques like prefix compression to reduce index size. Footer metadata stores critical information like the root block offset, tree depth, and key serialization schema. Range queries leverage the sorted nature of B-trees by traversing leaf nodes sequentially, while point lookups descend directly to the target leaf. The index structure supports various key types through pluggable serialization strategies, allowing the system to index different data types efficiently.

Usage

Apply B-tree global indexing when implementing point lookup or small range query optimization for partitioned tables, supporting secondary index functionality, or enabling efficient joins on non-partition keys. This pattern is essential when query patterns don't align with the physical partitioning scheme.

Theoretical Basis

The B-tree global index follows classical B-tree principles adapted for immutable storage:

Tree Structure

Internal nodes store M key-range separators and M+1 child pointers
Leaf nodes store N key-value pairs (key -> file location + row offset)
All leaf nodes are at the same depth, ensuring balanced tree
Nodes are stored as fixed-size blocks in sequential storage

Index Construction

function buildBTreeIndex(sortedKeyValuePairs):
    leafBlocks = createLeafBlocks(sortedKeyValuePairs, LEAF_BLOCK_SIZE)

    if leafBlocks.size() == 1:
        return createIndexFile(leafBlocks[0], rootOffset=0)

    // Build internal levels bottom-up
    currentLevel = leafBlocks

    while currentLevel.size() > 1:
        nextLevel = []

        for each group of MAX_FANOUT blocks in currentLevel:
            separatorKeys = extractFirstKeys(group)
            internalNode = createInternalNode(separatorKeys, blockPointers)
            nextLevel.add(internalNode)

        currentLevel = nextLevel

    rootBlock = currentLevel[0]
    return writeIndexFile(rootBlock, leafBlocks)

Lookup Algorithm

function lookup(key, indexFile):
    footer = readFooter(indexFile)
    currentBlockOffset = footer.rootOffset

    for level in 0 to footer.treeDepth:
        block = readBlock(indexFile, currentBlockOffset)

        if block.isLeaf():
            return block.findValue(key)
        else:
            // Binary search in internal node
            childIndex = block.findChildIndex(key)
            currentBlockOffset = block.getChildPointer(childIndex)

    return null  // Key not found

Range Scan

function rangeScan(startKey, endKey, indexFile):
    // Find first leaf containing startKey
    leafBlock = navigateToLeaf(startKey, indexFile)

    results = []

    while leafBlock != null:
        for each (key, value) in leafBlock:
            if key >= startKey and key <= endKey:
                results.add(value)
            else if key > endKey:
                return results

        leafBlock = leafBlock.nextLeafPointer

    return results

Block Format

Header: block type (leaf/internal), entry count, next block pointer (for leaves)
Body: serialized keys and values/pointers
Keys use prefix compression to reduce redundancy
Fixed-size entries enable binary search within blocks

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment