Principle:Treeverse LakeFS GC Result Verification
| Knowledge Sources | |
|---|---|
| Domains | Storage_Management, Data_Lifecycle |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
GC result verification confirms that expired objects were successfully deleted from the underlying storage while retained objects remain accessible, ensuring garbage collection correctness.
Description
After a garbage collection run completes, operators must verify two complementary properties:
- Expired objects were actually deleted: Objects that should have been garbage collected must no longer be accessible in the underlying object storage. If they remain, the GC job failed or was only partially successful, and storage costs continue to accumulate unnecessarily.
- Retained objects are still accessible: Objects that are within their retention window must still be readable. If they are missing, the GC job was overly aggressive and has caused data loss — a critical failure mode.
Verification uses two complementary approaches:
Logical verification — List objects via the lakeFS API to check whether they appear in the repository's namespace. This confirms that lakeFS's metadata about object existence is consistent.
Physical verification — Request presigned URLs for the objects and issue HTTP GET requests against those URLs. This confirms that the actual bytes exist (or do not exist) in the underlying object storage, independent of lakeFS's metadata.
The combination of logical and physical verification is important because:
- Logical verification alone cannot detect cases where metadata says an object exists but the physical bytes are missing (a phantom reference)
- Physical verification alone cannot detect cases where lakeFS metadata is out of sync with storage state
Usage
Perform GC result verification:
- After every production GC run, as part of the automated pipeline
- After the first GC run on a new repository, to build confidence in the configuration
- When debugging unexpected storage cost trends (objects not being deleted)
- When investigating data access failures that may be caused by incorrect GC
Theoretical Basis
GC result verification is an application of the trust but verify principle in distributed systems. Even when the GC pipeline is well-tested, production environments introduce variables (network partitions, storage throttling, partial failures, concurrent writes) that can cause unexpected outcomes.
The verification approach follows the two-oracle testing pattern:
| Oracle | Source of Truth | What It Checks |
|---|---|---|
| Logical oracle | lakeFS API (listObjects) | Object appears/does not appear in the repository's metadata |
| Physical oracle | Object storage (presigned URL HTTP GET) | Object bytes exist/do not exist in the storage backend |
When both oracles agree, confidence is high. When they disagree, the discrepancy indicates a bug or partial failure that requires investigation.
The use of presigned URLs for physical verification is architecturally significant: it allows the verification client to check object existence without needing direct credentials to the underlying storage. The lakeFS server generates time-limited, scoped URLs that grant read-only access to specific objects.
The verification pattern also follows the postcondition checking discipline from formal methods: after executing an operation (GC), explicitly check that the expected postconditions (deleted objects absent, retained objects present) hold true.