Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Lance format Lance Cleanup Old Versions

From Leeroopedia
Revision as of 15:26, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Lance_format_Lance_Cleanup_Old_Versions.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains Data_Engineering, Storage_Optimization
Last Updated 2026-02-08 19:00 GMT

Overview

Concrete tool for removing old dataset versions and their unreferenced files from storage to reclaim space, provided by the Lance library.

Description

Dataset::cleanup_old_versions constructs a CleanupPolicy from its arguments and delegates to cleanup_with_policy, which in turn calls the internal cleanup_old_versions function in the cleanup module. The cleanup task:

  1. Iterates all manifest files in the _versions/ directory.
  2. Applies the cleanup policy to determine which manifests are old and should be removed (the latest manifest is always retained).
  3. Tracks all files referenced by retained manifests (working set) and all files referenced by any manifest (verified set).
  4. Scans storage for data files, deletion files, index files, and transaction files.
  5. Deletes files that are not in the working set but are in the verified set (referenced only by old versions).
  6. Handles unverified files (not referenced by any manifest) based on the delete_unverified setting and the 7-day age threshold.
  7. Removes the old manifest files themselves.
  8. Returns RemovalStats with the total bytes removed and count of old versions cleaned.

An alternative entry point, cleanup_with_policy, accepts a pre-built CleanupPolicy for more flexible configuration including version-count-based retention.

The auto_cleanup_hook function enables automatic cleanup triggered every N versions, configured through dataset config keys lance.auto_cleanup.interval and lance.auto_cleanup.older_than.

Usage

Call cleanup_old_versions periodically or after batch operations to reclaim storage. For more control, build a CleanupPolicy using CleanupPolicyBuilder and call cleanup_with_policy.

Code Reference

Source Location

  • Repository: Lance
  • File: rust/lance/src/dataset.rs (L1170-L1186), rust/lance/src/dataset/cleanup.rs (L641-L647, L548-L557, L77-L80)
  • Lines: See above

Signature

impl Dataset {
    pub fn cleanup_old_versions(
        &self,
        older_than: Duration,
        delete_unverified: Option<bool>,
        error_if_tagged_old_versions: Option<bool>,
    ) -> BoxFuture<'_, Result<RemovalStats>>

    pub fn cleanup_with_policy(
        &self,
        policy: CleanupPolicy,
    ) -> BoxFuture<'_, Result<RemovalStats>>
}

Import

use lance::Dataset;
use lance::dataset::cleanup::{CleanupPolicy, CleanupPolicyBuilder, RemovalStats};
use chrono::Duration;

I/O Contract

Inputs

cleanup_old_versions parameters:

Name Type Required Description
self &Dataset Yes Reference to the dataset to clean up.
older_than Duration Yes Remove versions older than this duration from the current time.
delete_unverified Option<bool> No If Some(true), delete files not referenced by any manifest even if they are recent. Default behavior (None/Some(false)) only deletes unverified files older than 7 days.
error_if_tagged_old_versions Option<bool> No If Some(true), return an error if tagged versions fall within the cleanup window. Defaults to true.

CleanupPolicy fields:

Field Type Default Description
before_timestamp Option<DateTime<Utc>> None Clean all versions before this timestamp.
before_version Option<u64> None Clean all versions before this version number.
delete_unverified bool false Delete unverified files regardless of age.
error_if_tagged_old_versions bool true Error if tagged versions would be cleaned.

CleanupPolicyBuilder methods:

Method Description
before_timestamp(timestamp) Set the timestamp cutoff for cleanup.
retain_n_versions(dataset, n) Keep only the last N versions (async, queries version list).
delete_unverified(bool) Control unverified file deletion.
error_if_tagged_old_versions(bool) Control tagged version error behavior.
build() Produce the final CleanupPolicy.

Outputs

Name Type Description
RemovalStats struct Statistics about the cleanup operation.

RemovalStats fields:

Field Type Description
bytes_removed u64 Total bytes of storage reclaimed.
old_versions u64 Number of old versions removed.

Usage Examples

use lance::Dataset;
use chrono::Duration;

async fn cleanup_example(dataset: &Dataset) -> lance::Result<()> {
    // Remove versions older than 7 days
    let stats = dataset
        .cleanup_old_versions(Duration::days(7), None, None)
        .await?;

    println!(
        "Cleaned {} old versions, reclaimed {} bytes",
        stats.old_versions, stats.bytes_removed
    );

    Ok(())
}

// Using CleanupPolicyBuilder for more control:
use lance::dataset::cleanup::CleanupPolicyBuilder;

async fn advanced_cleanup(dataset: &Dataset) -> lance::Result<()> {
    let policy = CleanupPolicyBuilder::default()
        .retain_n_versions(dataset, 5)
        .await?
        .delete_unverified(false)
        .error_if_tagged_old_versions(false)
        .build();

    let stats = dataset.cleanup_with_policy(policy).await?;

    println!(
        "Cleaned {} old versions, reclaimed {} bytes",
        stats.old_versions, stats.bytes_removed
    );

    Ok(())
}

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment