Principle:Lance format Lance Index Optimization
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, Storage_Optimization |
| Last Updated | 2026-02-08 19:00 GMT |
Overview
Index optimization is the process of merging multiple incremental (delta) index segments into fewer, larger segments to improve query performance and reduce per-query overhead in a Lance dataset.
Description
Lance supports both vector indices (e.g., IVF) and scalar indices (e.g., BTree, inverted index). When new data is appended to an indexed table, Lance creates delta index segments covering only the newly added fragments rather than rebuilding the entire index. Over time, the number of delta segments grows, and each query must consult all of them, which degrades performance.
Index optimization addresses this by merging delta segments together. The process:
- Loads all current index metadata from the manifest.
- Groups indices by name (each index name may have multiple delta segments).
- For each group, calls
merge_indiceswhich opens each delta segment, determines whether a merge is needed based onOptimizeOptions, and produces a new unified index covering the combined fragment bitmap. - Commits the new merged index metadata as an
Operation::CreateIndextransaction, listing both the new indices and the old indices that were replaced.
The merge strategy is controlled by OptimizeOptions::num_indices_to_merge. When set to None, the system decides automatically: it creates a new delta if no partition split is needed, or merges all deltas if partitions were split. When set to Some(N), the latest N index segments are merged. Setting retrain to true ignores this parameter and retrains the entire index from scratch.
Separately, the IndexRemapper trait handles row ID remapping after compaction. When fragments are rewritten and row addresses change, every affected index must be updated to point to the new addresses. This can happen during compaction (immediate remap) or be deferred via the fragment reuse index.
Usage
Use index optimization:
- After several rounds of data appends to consolidate delta index segments.
- After compaction to ensure indices reference the correct row addresses.
- Periodically in a maintenance loop to keep query latency stable.
Theoretical Basis
Index optimization is a form of log-structured merge (LSM) compaction applied to index segments:
indices_by_name = group_by_name(all_indices)
for name, deltas in indices_by_name:
if retrain:
new_index = train_from_scratch(all_data)
else if num_indices_to_merge is Some(N):
new_index = merge(deltas[-N:], unindexed_fragments)
else:
new_index = auto_merge(deltas, unindexed_fragments)
commit CreateIndex(new_index, removed=merged_deltas)
Read amplification is the primary concern: with K delta segments, each query must search K segments. Merging reduces K to 1 (or fewer), restoring O(1) index lookups. The trade-off is write amplification from rewriting index data. The num_indices_to_merge parameter lets users control this trade-off.
For index remapping after compaction, the IndexRemapper trait receives a HashMap<u64, Option<u64>> mapping old row IDs to new row IDs (or None for deleted rows) and the list of affected fragment IDs. Each index implementation uses this map to update its internal row references without rebuilding the entire index.