Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Treeverse LakeFS GC Result Verification

From Leeroopedia


Knowledge Sources
Domains Storage_Management, Data_Lifecycle
Last Updated 2026-02-08 00:00 GMT

Overview

GC result verification confirms that expired objects were successfully deleted from the underlying storage while retained objects remain accessible, ensuring garbage collection correctness.

Description

After a garbage collection run completes, operators must verify two complementary properties:

  1. Expired objects were actually deleted: Objects that should have been garbage collected must no longer be accessible in the underlying object storage. If they remain, the GC job failed or was only partially successful, and storage costs continue to accumulate unnecessarily.
  1. Retained objects are still accessible: Objects that are within their retention window must still be readable. If they are missing, the GC job was overly aggressive and has caused data loss — a critical failure mode.

Verification uses two complementary approaches:

Logical verification — List objects via the lakeFS API to check whether they appear in the repository's namespace. This confirms that lakeFS's metadata about object existence is consistent.

Physical verification — Request presigned URLs for the objects and issue HTTP GET requests against those URLs. This confirms that the actual bytes exist (or do not exist) in the underlying object storage, independent of lakeFS's metadata.

The combination of logical and physical verification is important because:

  • Logical verification alone cannot detect cases where metadata says an object exists but the physical bytes are missing (a phantom reference)
  • Physical verification alone cannot detect cases where lakeFS metadata is out of sync with storage state

Usage

Perform GC result verification:

  • After every production GC run, as part of the automated pipeline
  • After the first GC run on a new repository, to build confidence in the configuration
  • When debugging unexpected storage cost trends (objects not being deleted)
  • When investigating data access failures that may be caused by incorrect GC

Theoretical Basis

GC result verification is an application of the trust but verify principle in distributed systems. Even when the GC pipeline is well-tested, production environments introduce variables (network partitions, storage throttling, partial failures, concurrent writes) that can cause unexpected outcomes.

The verification approach follows the two-oracle testing pattern:

Oracle Source of Truth What It Checks
Logical oracle lakeFS API (listObjects) Object appears/does not appear in the repository's metadata
Physical oracle Object storage (presigned URL HTTP GET) Object bytes exist/do not exist in the storage backend

When both oracles agree, confidence is high. When they disagree, the discrepancy indicates a bug or partial failure that requires investigation.

The use of presigned URLs for physical verification is architecturally significant: it allows the verification client to check object existence without needing direct credentials to the underlying storage. The lakeFS server generates time-limited, scoped URLs that grant read-only access to specific objects.

The verification pattern also follows the postcondition checking discipline from formal methods: after executing an operation (GC), explicitly check that the expected postconditions (deleted objects absent, retained objects present) hold true.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment