Principle:Lance format Lance Version Cleanup
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, Storage_Optimization |
| Last Updated | 2026-02-08 19:00 GMT |
Overview
Version cleanup is the process of removing old dataset versions and their exclusively-referenced files from storage to reclaim disk space and reduce metadata overhead in a Lance dataset.
Description
Lance uses an append-only versioning model: every write operation (insert, delete, update, compaction) creates a new version with its own manifest. Over time, these accumulate and consume storage. Version cleanup identifies old versions that are no longer needed, removes their manifest files, and deletes any data files, deletion files, index files, and transaction files that are no longer referenced by any remaining version.
The cleanup process is conservative by default to avoid data loss:
- Manifest inspection: All manifest files are loaded and inspected. Manifests satisfying the cleanup policy (e.g., older than a specified timestamp or before a specific version number) are marked for removal, except the latest manifest which is never removed.
- Reference tracking: Files referenced by retained manifests form the working set. Files referenced by old (removed) manifests but not by any retained manifest are candidates for deletion.
- Unverified file handling: Files not referenced by any manifest at all are ambiguous -- they could be leftover from an abandoned transaction or part of an in-progress operation. By default, these are only deleted if they are at least 7 days old. Setting
delete_unverifiedto true removes them immediately, but this should only be used when no concurrent writers are active.
- Tagged version protection: If tagged versions (named references) fall within the cleanup window, the default behavior is to raise an error. This can be overridden with
error_if_tagged_old_versions.
The cleanup can be configured through the CleanupPolicy builder which supports timestamp-based cleanup, version-count-based retention, and the unverified file handling options. An automatic cleanup hook (auto_cleanup_hook) can be configured through dataset configuration keys to run cleanup periodically after commits.
Usage
Use version cleanup:
- Periodically in a maintenance job to reclaim storage space.
- After a series of compaction operations that generated many intermediate versions.
- When storage costs are a concern and old versions are no longer needed for time-travel queries.
- Through the auto-cleanup hook for hands-off maintenance.
Theoretical Basis
Version cleanup is a form of garbage collection over an immutable, versioned file system:
retained_manifests = {m for m in all_manifests if not policy.should_clean(m) or m == latest}
old_manifests = all_manifests - retained_manifests
referenced_files = union(files_in(m) for m in retained_manifests)
verified_files = union(files_in(m) for m in all_manifests)
for file in storage:
if file in referenced_files:
keep // part of working set
else if file in verified_files:
delete // referenced only by old versions
else if file.age > 7 days or delete_unverified:
delete // unverified, likely abandoned
else:
keep // possibly in-progress operation
for manifest in old_manifests:
delete manifest
Key safety invariants:
- Latest version protection: The current version is never removed, ensuring the dataset always remains readable.
- Conservative unverified handling: The 7-day default threshold provides a safety margin against deleting files from long-running concurrent operations.
- Tag awareness: Tagged versions serve as named checkpoints; the system warns before removing them.
- Idempotency: Running cleanup multiple times produces the same result; files already deleted are simply not found.
The file types subject to cleanup include: manifest files (_versions/), data files (data/), deletion files (_deletions/), index files (_indices/), and transaction files (_transactions/).