Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Treeverse LakeFS Garbage Collection

From Leeroopedia
Revision as of 11:01, 16 February 2026 by Admin (talk | contribs) (Auto-imported from workflows/Treeverse_LakeFS_Garbage_Collection.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains Data_Engineering, Storage_Management, Data_Lake_Management
Last Updated 2026-02-08 10:00 GMT

Overview

End-to-end process for reclaiming storage space by removing unreferenced data objects from the underlying object storage based on configurable retention policies.

Description

This workflow describes how to manage storage lifecycle in lakeFS by configuring and running garbage collection (GC). As branches are deleted and data is overwritten or removed, the underlying objects in storage may become unreferenced by any live commit. Garbage collection identifies these orphaned objects and removes them from the physical storage. Retention rules are configured per branch to control how long historical data is preserved. The GC process runs as a Spark job that analyzes the repository metadata and safely deletes expired objects.

Usage

Execute this workflow when storage costs are growing due to accumulated historical data, deleted branches, or overwritten objects. Common triggers include: periodic storage cost optimization, compliance requirements mandating data deletion after retention periods, repository maintenance after significant branch cleanup, or storage quota management in shared data lake environments.

Execution Steps

Step 1: Configure Retention Rules

Define garbage collection retention rules for the repository. Rules are set per branch and specify the number of days that unreferenced data should be retained before becoming eligible for deletion. The default branch typically has the longest retention period, while feature or temporary branches may have shorter retention.

Key considerations:

  • Rules are specified per branch with a retention period in days
  • A default rule can apply to branches without explicit rules
  • Shorter retention saves storage but reduces the window for rollback
  • Rules should account for downstream consumers that may reference older commits

Step 2: Prepare GC Metadata

Initiate the GC preparation phase, which analyzes the repository metadata to identify all committed and uncommitted object references. This step scans all branches, commits, and tags to build a complete picture of which objects are still referenced and which are candidates for deletion.

Key considerations:

  • Preparation can be triggered via the lakeFS API
  • The process produces metadata files that the GC job consumes
  • Both committed objects and uncommitted (staged) objects are considered
  • This step is read-only and does not delete any data

Step 3: Execute GC Job

Run the garbage collection Spark job against the prepared metadata. The job compares the set of all objects in storage against the set of referenced objects, applies retention rules, and identifies objects that are safe to delete. The job then removes the unreferenced objects from the underlying storage.

Key considerations:

  • The GC job runs as a Spark application for scalability
  • Only objects that are unreferenced AND past their retention period are deleted
  • The job must have write access to the underlying storage to delete objects
  • Running GC during low-traffic periods is recommended

Step 4: Verify GC Results

After the GC job completes, verify that the expected objects were removed and that referenced data remains intact. Check storage metrics to confirm space reclamation, and validate that active branches, commits, and tags still resolve correctly.

Key considerations:

  • Verify that no referenced objects were incorrectly deleted
  • Storage savings should be proportional to the amount of unreferenced data
  • All active branches and tags should remain fully functional
  • GC run logs provide details on objects deleted and space reclaimed

Step 5: Schedule Periodic GC

Set up a recurring schedule for garbage collection to run automatically. Periodic GC prevents unbounded storage growth and ensures that retention policies are consistently enforced. The frequency should balance storage cost savings against GC job overhead.

Key considerations:

  • Weekly or bi-weekly GC runs are common for active repositories
  • Schedule GC during maintenance windows or low-activity periods
  • Monitor GC job duration and storage reclamation trends over time
  • Adjust retention rules as data patterns and compliance requirements evolve

Execution Diagram

GitHub URL

Workflow Repository