Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Lance format Lance Version Cleanup

From Leeroopedia


Knowledge Sources
Domains Data_Engineering, Storage_Optimization
Last Updated 2026-02-08 19:00 GMT

Overview

Version cleanup is the process of removing old dataset versions and their exclusively-referenced files from storage to reclaim disk space and reduce metadata overhead in a Lance dataset.

Description

Lance uses an append-only versioning model: every write operation (insert, delete, update, compaction) creates a new version with its own manifest. Over time, these accumulate and consume storage. Version cleanup identifies old versions that are no longer needed, removes their manifest files, and deletes any data files, deletion files, index files, and transaction files that are no longer referenced by any remaining version.

The cleanup process is conservative by default to avoid data loss:

  1. Manifest inspection: All manifest files are loaded and inspected. Manifests satisfying the cleanup policy (e.g., older than a specified timestamp or before a specific version number) are marked for removal, except the latest manifest which is never removed.
  1. Reference tracking: Files referenced by retained manifests form the working set. Files referenced by old (removed) manifests but not by any retained manifest are candidates for deletion.
  1. Unverified file handling: Files not referenced by any manifest at all are ambiguous -- they could be leftover from an abandoned transaction or part of an in-progress operation. By default, these are only deleted if they are at least 7 days old. Setting delete_unverified to true removes them immediately, but this should only be used when no concurrent writers are active.
  1. Tagged version protection: If tagged versions (named references) fall within the cleanup window, the default behavior is to raise an error. This can be overridden with error_if_tagged_old_versions.

The cleanup can be configured through the CleanupPolicy builder which supports timestamp-based cleanup, version-count-based retention, and the unverified file handling options. An automatic cleanup hook (auto_cleanup_hook) can be configured through dataset configuration keys to run cleanup periodically after commits.

Usage

Use version cleanup:

  • Periodically in a maintenance job to reclaim storage space.
  • After a series of compaction operations that generated many intermediate versions.
  • When storage costs are a concern and old versions are no longer needed for time-travel queries.
  • Through the auto-cleanup hook for hands-off maintenance.

Theoretical Basis

Version cleanup is a form of garbage collection over an immutable, versioned file system:

retained_manifests = {m for m in all_manifests if not policy.should_clean(m) or m == latest}
old_manifests = all_manifests - retained_manifests

referenced_files = union(files_in(m) for m in retained_manifests)
verified_files = union(files_in(m) for m in all_manifests)

for file in storage:
    if file in referenced_files:
        keep  // part of working set
    else if file in verified_files:
        delete  // referenced only by old versions
    else if file.age > 7 days or delete_unverified:
        delete  // unverified, likely abandoned
    else:
        keep  // possibly in-progress operation

for manifest in old_manifests:
    delete manifest

Key safety invariants:

  • Latest version protection: The current version is never removed, ensuring the dataset always remains readable.
  • Conservative unverified handling: The 7-day default threshold provides a safety margin against deleting files from long-running concurrent operations.
  • Tag awareness: Tagged versions serve as named checkpoints; the system warns before removing them.
  • Idempotency: Running cleanup multiple times produces the same result; files already deleted are simply not found.

The file types subject to cleanup include: manifest files (_versions/), data files (data/), deletion files (_deletions/), index files (_indices/), and transaction files (_transactions/).

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment