Implementation:Lance format Lance Cleanup Old Versions
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, Storage_Optimization |
| Last Updated | 2026-02-08 19:00 GMT |
Overview
Concrete tool for removing old dataset versions and their unreferenced files from storage to reclaim space, provided by the Lance library.
Description
Dataset::cleanup_old_versions constructs a CleanupPolicy from its arguments and delegates to cleanup_with_policy, which in turn calls the internal cleanup_old_versions function in the cleanup module. The cleanup task:
- Iterates all manifest files in the
_versions/directory. - Applies the cleanup policy to determine which manifests are old and should be removed (the latest manifest is always retained).
- Tracks all files referenced by retained manifests (working set) and all files referenced by any manifest (verified set).
- Scans storage for data files, deletion files, index files, and transaction files.
- Deletes files that are not in the working set but are in the verified set (referenced only by old versions).
- Handles unverified files (not referenced by any manifest) based on the
delete_unverifiedsetting and the 7-day age threshold. - Removes the old manifest files themselves.
- Returns
RemovalStatswith the total bytes removed and count of old versions cleaned.
An alternative entry point, cleanup_with_policy, accepts a pre-built CleanupPolicy for more flexible configuration including version-count-based retention.
The auto_cleanup_hook function enables automatic cleanup triggered every N versions, configured through dataset config keys lance.auto_cleanup.interval and lance.auto_cleanup.older_than.
Usage
Call cleanup_old_versions periodically or after batch operations to reclaim storage. For more control, build a CleanupPolicy using CleanupPolicyBuilder and call cleanup_with_policy.
Code Reference
Source Location
- Repository: Lance
- File:
rust/lance/src/dataset.rs(L1170-L1186),rust/lance/src/dataset/cleanup.rs(L641-L647, L548-L557, L77-L80) - Lines: See above
Signature
impl Dataset {
pub fn cleanup_old_versions(
&self,
older_than: Duration,
delete_unverified: Option<bool>,
error_if_tagged_old_versions: Option<bool>,
) -> BoxFuture<'_, Result<RemovalStats>>
pub fn cleanup_with_policy(
&self,
policy: CleanupPolicy,
) -> BoxFuture<'_, Result<RemovalStats>>
}
Import
use lance::Dataset;
use lance::dataset::cleanup::{CleanupPolicy, CleanupPolicyBuilder, RemovalStats};
use chrono::Duration;
I/O Contract
Inputs
cleanup_old_versions parameters:
| Name | Type | Required | Description |
|---|---|---|---|
| self | &Dataset | Yes | Reference to the dataset to clean up. |
| older_than | Duration | Yes | Remove versions older than this duration from the current time. |
| delete_unverified | Option<bool> | No | If Some(true), delete files not referenced by any manifest even if they are recent. Default behavior (None/Some(false)) only deletes unverified files older than 7 days. |
| error_if_tagged_old_versions | Option<bool> | No | If Some(true), return an error if tagged versions fall within the cleanup window. Defaults to true. |
CleanupPolicy fields:
| Field | Type | Default | Description |
|---|---|---|---|
| before_timestamp | Option<DateTime<Utc>> | None | Clean all versions before this timestamp. |
| before_version | Option<u64> | None | Clean all versions before this version number. |
| delete_unverified | bool | false | Delete unverified files regardless of age. |
| error_if_tagged_old_versions | bool | true | Error if tagged versions would be cleaned. |
CleanupPolicyBuilder methods:
| Method | Description |
|---|---|
| before_timestamp(timestamp) | Set the timestamp cutoff for cleanup. |
| retain_n_versions(dataset, n) | Keep only the last N versions (async, queries version list). |
| delete_unverified(bool) | Control unverified file deletion. |
| error_if_tagged_old_versions(bool) | Control tagged version error behavior. |
| build() | Produce the final CleanupPolicy.
|
Outputs
| Name | Type | Description |
|---|---|---|
| RemovalStats | struct | Statistics about the cleanup operation. |
RemovalStats fields:
| Field | Type | Description |
|---|---|---|
| bytes_removed | u64 | Total bytes of storage reclaimed. |
| old_versions | u64 | Number of old versions removed. |
Usage Examples
use lance::Dataset;
use chrono::Duration;
async fn cleanup_example(dataset: &Dataset) -> lance::Result<()> {
// Remove versions older than 7 days
let stats = dataset
.cleanup_old_versions(Duration::days(7), None, None)
.await?;
println!(
"Cleaned {} old versions, reclaimed {} bytes",
stats.old_versions, stats.bytes_removed
);
Ok(())
}
// Using CleanupPolicyBuilder for more control:
use lance::dataset::cleanup::CleanupPolicyBuilder;
async fn advanced_cleanup(dataset: &Dataset) -> lance::Result<()> {
let policy = CleanupPolicyBuilder::default()
.retain_n_versions(dataset, 5)
.await?
.delete_unverified(false)
.error_if_tagged_old_versions(false)
.build();
let stats = dataset.cleanup_with_policy(policy).await?;
println!(
"Cleaned {} old versions, reclaimed {} bytes",
stats.old_versions, stats.bytes_removed
);
Ok(())
}