Principle:Lance format Lance Row Deletion
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, Columnar_Storage |
| Last Updated | 2026-02-08 19:00 GMT |
Overview
Row deletion removes rows from a Lance dataset that match a SQL predicate, using a soft-delete mechanism with deletion vectors that avoids immediate data rewriting.
Description
Lance implements row deletion using a deletion vector strategy rather than immediately rewriting data files. When rows are deleted:
- The predicate is evaluated against the dataset to identify matching rows.
- For each affected fragment, a deletion vector (a bitmap of deleted row offsets) is written alongside the fragment metadata.
- A new dataset version is committed that includes the updated deletion vectors.
The actual data bytes remain in the underlying storage files. Subsequent reads automatically filter out deleted rows by consulting the deletion vectors. The data is only physically removed when a compaction operation rewrites the fragments, followed by a cleanup step that removes old files.
This design has several advantages:
- Speed: Deletion is a metadata-only operation (writing small bitmaps), so it completes in time proportional to the number of affected fragments, not the number of deleted rows.
- Concurrency safety: Deletion vectors can be written concurrently with reads. Readers on older versions are unaffected.
- Reversibility: Since data is not physically removed, previous versions remain fully accessible for time-travel queries.
The deletion API also supports truncation (deleting all rows) via the truncate_table convenience method, which internally calls delete("true").
Usage
Use row deletion when:
- Removing rows that match a filter condition (e.g., GDPR right-to-erasure, data quality cleanup).
- Truncating a dataset to start fresh while preserving the schema and version history.
- Implementing soft-delete patterns where data can be recovered from prior versions.
- Performing delete operations in a pipeline that will later compact the dataset to reclaim storage.
Theoretical Basis
Deletion Vector Model
Lance's deletion vectors are stored as bitmaps (one per fragment) that track which row offsets within a fragment have been logically deleted. The implementation uses RoaringBitmap for compact, efficient storage of sparse and dense deletion patterns.
When a delete operation is executed:
- Predicate evaluation: The filter predicate is compiled into a physical expression and evaluated against each fragment. For fragments with scalar indices, the index may accelerate the evaluation.
- Deletion vector update: For each fragment containing matching rows, the row offsets are added to the fragment's deletion vector. If the fragment already has a deletion vector (from a prior delete in the same version lineage), the new deletions are merged.
- Transaction commit: A transaction is created with the updated fragment metadata (including the new deletion vectors) and committed. Lance's retry mechanism handles concurrent write conflicts (default: 10 retries, 30-second timeout).
Physical Cleanup
Deletion vectors do not reduce storage consumption. To physically reclaim space:
- Run
optimize::compact_files()to rewrite fragments without the deleted rows. - Run
cleanup::cleanup_old_versions()to remove old data files and deletion vectors.
This two-step approach allows users to batch deletes and compact periodically, amortizing the cost of data rewriting.