Principle:Treeverse LakeFS GC Metadata Preparation
| Knowledge Sources | |
|---|---|
| Domains | Storage_Management, Data_Lifecycle |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
GC metadata preparation identifies expired commits and their associated physical addresses, producing structured output files that drive the subsequent garbage collection deletion phase.
Description
Before any objects can be safely deleted from the underlying object storage, the system must determine which objects are still needed and which are eligible for removal. This is the metadata preparation phase of the garbage collection pipeline.
The preparation process works by:
- Enumerating all branches in the repository and their configured retention periods
- Walking the commit graph for each branch to identify commits whose age exceeds the branch's retention window
- Collecting physical addresses referenced exclusively by expired commits (i.e., addresses not referenced by any still-alive commit on any branch)
- Writing the results to structured files (CSV for commits, Parquet for addresses) in a well-known location on the repository's backing object storage
The output of this phase is a self-contained dataset that the downstream Spark GC job can consume independently. This includes:
- A commits CSV listing which commits are alive and which are expired
- An addresses file listing the physical storage paths of objects to be deleted
- A run ID that uniquely identifies this preparation run, used to correlate with the Spark job
This phase is deliberately read-only with respect to the object store's data objects. It reads metadata (commits, references, manifests) but does not delete anything. Deletion is deferred to the Spark job.
Usage
Trigger metadata preparation when:
- Running the GC pipeline on a schedule (the preparation step always precedes the Spark deletion job)
- Performing a dry run to understand what would be deleted without actually deleting anything
- Debugging GC behavior by inspecting the generated CSV/Parquet files
- Testing retention rule changes before committing to a full GC run
Theoretical Basis
The separation of identification from execution in the GC pipeline follows the classic two-phase pattern used in distributed garbage collection systems:
Phase 1 (Mark): Identify live and dead objects by traversing the reference graph. In lakeFS, this is the metadata preparation step, which walks the commit DAG and determines liveness based on retention rules.
Phase 2 (Sweep): Delete the dead objects. In lakeFS, this is delegated to the Spark job, which performs bulk deletion against the object store.
This separation provides several advantages:
| Advantage | Explanation |
|---|---|
| Scalability | The mark phase runs inside the lakeFS server (which has access to the commit graph), while the sweep phase runs in Spark (which can parallelize deletion across hundreds of executors) |
| Safety | The mark phase output can be inspected before the sweep runs, preventing accidental data loss |
| Idempotency | If the Spark job fails partway through, it can be re-run with the same run ID without re-preparing metadata |
| Auditability | The CSV/Parquet files serve as a permanent record of what was marked for deletion and why |
The use of CSV and Parquet as interchange formats is deliberate: these formats are natively supported by Apache Spark and can be easily inspected with standard data tools (e.g., DuckDB, pandas, AWS Athena).