Workflow:Treeverse LakeFS Garbage Collection

Knowledge Sources	lakeFS lakeFS Documentation Garbage Collection
Domains	Data_Engineering, Storage_Management, Data_Lake_Management
Last Updated	2026-02-08 10:00 GMT

Overview

End-to-end process for reclaiming storage space by removing unreferenced data objects from the underlying object storage based on configurable retention policies.

Description

This workflow describes how to manage storage lifecycle in lakeFS by configuring and running garbage collection (GC). As branches are deleted and data is overwritten or removed, the underlying objects in storage may become unreferenced by any live commit. Garbage collection identifies these orphaned objects and removes them from the physical storage. Retention rules are configured per branch to control how long historical data is preserved. The GC process runs as a Spark job that analyzes the repository metadata and safely deletes expired objects.

Usage

Execute this workflow when storage costs are growing due to accumulated historical data, deleted branches, or overwritten objects. Common triggers include: periodic storage cost optimization, compliance requirements mandating data deletion after retention periods, repository maintenance after significant branch cleanup, or storage quota management in shared data lake environments.

Execution Steps

Step 1: Configure Retention Rules

Define garbage collection retention rules for the repository. Rules are set per branch and specify the number of days that unreferenced data should be retained before becoming eligible for deletion. The default branch typically has the longest retention period, while feature or temporary branches may have shorter retention.

Key considerations:

Rules are specified per branch with a retention period in days
A default rule can apply to branches without explicit rules
Shorter retention saves storage but reduces the window for rollback
Rules should account for downstream consumers that may reference older commits

Step 2: Prepare GC Metadata

Initiate the GC preparation phase, which analyzes the repository metadata to identify all committed and uncommitted object references. This step scans all branches, commits, and tags to build a complete picture of which objects are still referenced and which are candidates for deletion.

Key considerations:

Preparation can be triggered via the lakeFS API
The process produces metadata files that the GC job consumes
Both committed objects and uncommitted (staged) objects are considered
This step is read-only and does not delete any data

Step 3: Execute GC Job

Run the garbage collection Spark job against the prepared metadata. The job compares the set of all objects in storage against the set of referenced objects, applies retention rules, and identifies objects that are safe to delete. The job then removes the unreferenced objects from the underlying storage.

Key considerations:

The GC job runs as a Spark application for scalability
Only objects that are unreferenced AND past their retention period are deleted
The job must have write access to the underlying storage to delete objects
Running GC during low-traffic periods is recommended

Step 4: Verify GC Results

After the GC job completes, verify that the expected objects were removed and that referenced data remains intact. Check storage metrics to confirm space reclamation, and validate that active branches, commits, and tags still resolve correctly.

Key considerations:

Verify that no referenced objects were incorrectly deleted
Storage savings should be proportional to the amount of unreferenced data
All active branches and tags should remain fully functional
GC run logs provide details on objects deleted and space reclaimed

Step 5: Schedule Periodic GC

Set up a recurring schedule for garbage collection to run automatically. Periodic GC prevents unbounded storage growth and ensures that retention policies are consistently enforced. The frequency should balance storage cost savings against GC job overhead.

Key considerations:

Weekly or bi-weekly GC runs are common for active repositories
Schedule GC during maintenance windows or low-activity periods
Monitor GC job duration and storage reclamation trends over time
Adjust retention rules as data patterns and compliance requirements evolve

Execution Diagram

GitHub URL

Workflow Repository