Principle:Treeverse LakeFS GC Metadata Preparation

Knowledge Sources	lakeFS lakeFS Documentation
Domains	Storage_Management, Data_Lifecycle
Last Updated	2026-02-08 00:00 GMT

Overview

GC metadata preparation identifies expired commits and their associated physical addresses, producing structured output files that drive the subsequent garbage collection deletion phase.

Description

Before any objects can be safely deleted from the underlying object storage, the system must determine which objects are still needed and which are eligible for removal. This is the metadata preparation phase of the garbage collection pipeline.

The preparation process works by:

Enumerating all branches in the repository and their configured retention periods
Walking the commit graph for each branch to identify commits whose age exceeds the branch's retention window
Collecting physical addresses referenced exclusively by expired commits (i.e., addresses not referenced by any still-alive commit on any branch)
Writing the results to structured files (CSV for commits, Parquet for addresses) in a well-known location on the repository's backing object storage

The output of this phase is a self-contained dataset that the downstream Spark GC job can consume independently. This includes:

A commits CSV listing which commits are alive and which are expired
An addresses file listing the physical storage paths of objects to be deleted
A run ID that uniquely identifies this preparation run, used to correlate with the Spark job

This phase is deliberately read-only with respect to the object store's data objects. It reads metadata (commits, references, manifests) but does not delete anything. Deletion is deferred to the Spark job.

Usage

Trigger metadata preparation when:

Running the GC pipeline on a schedule (the preparation step always precedes the Spark deletion job)
Performing a dry run to understand what would be deleted without actually deleting anything
Debugging GC behavior by inspecting the generated CSV/Parquet files
Testing retention rule changes before committing to a full GC run

Theoretical Basis

The separation of identification from execution in the GC pipeline follows the classic two-phase pattern used in distributed garbage collection systems:

Phase 1 (Mark): Identify live and dead objects by traversing the reference graph. In lakeFS, this is the metadata preparation step, which walks the commit DAG and determines liveness based on retention rules.

Phase 2 (Sweep): Delete the dead objects. In lakeFS, this is delegated to the Spark job, which performs bulk deletion against the object store.

This separation provides several advantages:

Advantage	Explanation
Scalability	The mark phase runs inside the lakeFS server (which has access to the commit graph), while the sweep phase runs in Spark (which can parallelize deletion across hundreds of executors)
Safety	The mark phase output can be inspected before the sweep runs, preventing accidental data loss
Idempotency	If the Spark job fails partway through, it can be re-run with the same run ID without re-preparing metadata
Auditability	The CSV/Parquet files serve as a permanent record of what was marked for deletion and why

The use of CSV and Parquet as interchange formats is deliberate: these formats are natively supported by Apache Spark and can be easily inspected with standard data tools (e.g., DuckDB, pandas, AWS Athena).

Related Pages

Implemented By

Implementation:Treeverse_LakeFS_PrepareGarbageCollectionCommits

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment