Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Lance format Lance Index Optimization

From Leeroopedia


Knowledge Sources
Domains Data_Engineering, Storage_Optimization
Last Updated 2026-02-08 19:00 GMT

Overview

Index optimization is the process of merging multiple incremental (delta) index segments into fewer, larger segments to improve query performance and reduce per-query overhead in a Lance dataset.

Description

Lance supports both vector indices (e.g., IVF) and scalar indices (e.g., BTree, inverted index). When new data is appended to an indexed table, Lance creates delta index segments covering only the newly added fragments rather than rebuilding the entire index. Over time, the number of delta segments grows, and each query must consult all of them, which degrades performance.

Index optimization addresses this by merging delta segments together. The process:

  1. Loads all current index metadata from the manifest.
  2. Groups indices by name (each index name may have multiple delta segments).
  3. For each group, calls merge_indices which opens each delta segment, determines whether a merge is needed based on OptimizeOptions, and produces a new unified index covering the combined fragment bitmap.
  4. Commits the new merged index metadata as an Operation::CreateIndex transaction, listing both the new indices and the old indices that were replaced.

The merge strategy is controlled by OptimizeOptions::num_indices_to_merge. When set to None, the system decides automatically: it creates a new delta if no partition split is needed, or merges all deltas if partitions were split. When set to Some(N), the latest N index segments are merged. Setting retrain to true ignores this parameter and retrains the entire index from scratch.

Separately, the IndexRemapper trait handles row ID remapping after compaction. When fragments are rewritten and row addresses change, every affected index must be updated to point to the new addresses. This can happen during compaction (immediate remap) or be deferred via the fragment reuse index.

Usage

Use index optimization:

  • After several rounds of data appends to consolidate delta index segments.
  • After compaction to ensure indices reference the correct row addresses.
  • Periodically in a maintenance loop to keep query latency stable.

Theoretical Basis

Index optimization is a form of log-structured merge (LSM) compaction applied to index segments:

indices_by_name = group_by_name(all_indices)
for name, deltas in indices_by_name:
    if retrain:
        new_index = train_from_scratch(all_data)
    else if num_indices_to_merge is Some(N):
        new_index = merge(deltas[-N:], unindexed_fragments)
    else:
        new_index = auto_merge(deltas, unindexed_fragments)

    commit CreateIndex(new_index, removed=merged_deltas)

Read amplification is the primary concern: with K delta segments, each query must search K segments. Merging reduces K to 1 (or fewer), restoring O(1) index lookups. The trade-off is write amplification from rewriting index data. The num_indices_to_merge parameter lets users control this trade-off.

For index remapping after compaction, the IndexRemapper trait receives a HashMap<u64, Option<u64>> mapping old row IDs to new row IDs (or None for deleted rows) and the list of affected fragment IDs. Each index implementation uses this map to update its internal row references without rebuilding the entire index.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment