Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Lance format Lance Table Optimization

From Leeroopedia


Knowledge Sources
Domains Data_Engineering, Storage_Optimization, ML_Ops
Last Updated 2026-02-08 19:00 GMT

Overview

End-to-end process for optimizing Lance table storage through file compaction, index optimization, and old version cleanup to maintain query performance and minimize storage costs.

Description

This workflow addresses the storage fragmentation that accumulates as a Lance dataset undergoes repeated writes, updates, and deletes. Over time, datasets develop many small fragments (from frequent appends), high deletion ratios within fragments (from deletes and updates), and stale index data (from schema changes or data mutations). The optimization pipeline compacts small fragments into larger ones, materializes pending deletions, remaps indices to new fragment layouts, and removes unreferenced data files from old versions.

Usage

Execute this workflow periodically when a dataset has accumulated many small fragments from frequent append operations, when deletion ratios are high (causing wasted I/O during scans), when index performance has degraded due to unindexed data, or when storage costs need to be reduced by cleaning up old versions.

Execution Steps

Step 1: Compaction Planning

Analyze the current dataset state to identify fragments that need compaction. The planner evaluates two criteria: fragments with fewer rows than the target size (default 1 million rows per fragment) that have adjacent neighbors to merge with, and fragments whose deletion ratio exceeds the materialization threshold (default 10%). The output is a compaction plan containing a set of independent rewrite tasks.

Key considerations:

  • Compaction plans respect index compatibility constraints
  • Tasks are independent and can be distributed across machines
  • Planning reads only fragment metadata, not actual data
  • Configure target_rows_per_fragment based on typical query patterns

Step 2: Fragment Rewriting

Execute the compaction tasks by reading source fragments, materializing deletions (removing deleted rows physically), and writing consolidated output fragments. Each task merges a group of small fragments into fewer, larger fragments at the target row count. Binary copy optimization skips re-encoding when source and target formats match.

Key considerations:

  • Rewriting is the most I/O-intensive step
  • Binary copy mode dramatically speeds up compaction when no re-encoding is needed
  • Each task produces a RewriteResult containing old and new fragment metadata
  • Tasks can execute in parallel across multiple machines for distributed compaction

Step 3: Index Optimization

After compaction changes the fragment layout, indices must be updated to reference the new fragments. This step remaps existing index entries from old fragment IDs to new fragment IDs, and merges any unindexed data from recent writes into the index. For vector indices, this may involve re-training partition centroids on the updated data distribution.

Key considerations:

  • Index remapping avoids full index rebuilds when possible
  • Unindexed rows degrade search quality; this step incorporates them
  • Vector index optimization can update centroids and quantization codebooks
  • Scalar indices (BTree, bitmap) are remapped without retraining

Step 4: Compaction Commit

Commit all rewrite results as a single atomic transaction. This creates a new dataset version with the compacted fragment layout and updated indices. The commit verifies that no conflicting writes occurred during compaction and handles retries if conflicts are detected.

Key considerations:

  • The commit is atomic; either all changes apply or none do
  • Concurrent writers may cause conflict; retry logic handles this
  • The new version references compacted fragments and updated indices
  • Old fragments become unreferenced but are not yet deleted

Step 5: Old Version Cleanup

Remove data files that are no longer referenced by any active dataset version. This reclaims storage from old fragments, deleted index files, and expired deletion vectors. The cleanup process respects a configurable retention period to allow concurrent readers to finish before files are removed.

Key considerations:

  • Configure retention period based on longest expected read operation
  • Cleanup is safe to run concurrently with reads and writes
  • Only files unreferenced by ALL versions beyond the retention window are removed
  • Monitor storage reclamation to verify cleanup effectiveness

Execution Diagram

GitHub URL

Workflow Repository