Principle:Treeverse LakeFS Periodic GC Scheduling
| Knowledge Sources | |
|---|---|
| Domains | Storage_Management, Data_Lifecycle |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Periodic GC scheduling automates the garbage collection pipeline on a recurring basis, continuously reclaiming storage from expired data versions without manual intervention.
Description
Garbage collection is not a one-time operation. As a lakeFS repository accumulates commits over time, new objects continuously become eligible for collection as they age past their retention window. Without periodic execution, storage costs grow unbounded even though the data is no longer needed.
Periodic GC scheduling addresses this by automating the full GC pipeline as a recurring job:
- Configure retention rules (if rules need to be updated; typically a one-time or infrequent step)
- Prepare GC metadata by calling the
prepareGarbageCollectionCommitsAPI - Execute the Spark GC job using the returned
run_id - Verify results by checking that expired objects were deleted and retained objects remain
This pipeline must be orchestrated by an external scheduler because lakeFS itself does not include a built-in job scheduler. Common orchestration tools include:
- cron — Simple, widely available, suitable for single-machine deployments
- Apache Airflow — Full-featured workflow orchestration with dependency management, retries, and monitoring
- AWS Step Functions — Serverless orchestration for AWS-native deployments
- Kubernetes CronJob — Container-native scheduling for Kubernetes environments
Usage
Set up periodic GC scheduling when:
- A repository is in active use and accumulating historical data versions
- Storage costs need to be managed on an ongoing basis
- Compliance requirements mandate timely deletion of expired data
- The GC pipeline has been validated through manual runs and is ready for automation
Theoretical Basis
The scheduling of garbage collection involves a cost optimization trade-off between two competing concerns:
Storage cost: Increases continuously as new data versions accumulate. More frequent GC reduces storage cost by reclaiming space sooner.
Compute cost: Each GC run incurs compute costs (Spark cluster time, API calls, network I/O). More frequent GC increases compute cost.
The optimal frequency minimizes the total cost (storage + compute) over time:
| GC Frequency | Storage Cost Impact | Compute Cost Impact | Typical Use Case |
|---|---|---|---|
| Daily | Lowest (expired data reclaimed within 24 hours) | Highest (365 runs/year) | High-churn repositories with expensive storage (e.g., large Parquet datasets on S3) |
| Weekly | Moderate (up to 7 days of expired data accumulation) | Moderate (52 runs/year) | Most production repositories (good balance of cost and freshness) |
| Monthly | Higher (up to 30 days of expired data accumulation) | Lowest (12 runs/year) | Low-churn repositories or repositories with inexpensive storage |
The scheduling architecture also embodies the pipeline pattern from workflow orchestration theory. Each step in the GC pipeline has:
- A clear precondition (e.g., Spark job requires a valid
run_idfrom the preparation step) - A clear postcondition (e.g., preparation produces metadata files at known locations)
- Idempotency where possible (e.g., the Spark job can be retried with the same
run_id)
These properties make the pipeline suitable for orchestration tools that support automatic retries and dependency management.
The end-to-end automation of the pipeline (from rule configuration through verification) follows the closed-loop control principle: the system not only performs the action (deletion) but also verifies the outcome, enabling alerting on failures and building confidence in the ongoing operation.