Principle:Treeverse LakeFS Periodic GC Scheduling

Knowledge Sources	lakeFS lakeFS Documentation
Domains	Storage_Management, Data_Lifecycle
Last Updated	2026-02-08 00:00 GMT

Overview

Periodic GC scheduling automates the garbage collection pipeline on a recurring basis, continuously reclaiming storage from expired data versions without manual intervention.

Description

Garbage collection is not a one-time operation. As a lakeFS repository accumulates commits over time, new objects continuously become eligible for collection as they age past their retention window. Without periodic execution, storage costs grow unbounded even though the data is no longer needed.

Periodic GC scheduling addresses this by automating the full GC pipeline as a recurring job:

Configure retention rules (if rules need to be updated; typically a one-time or infrequent step)
Prepare GC metadata by calling the prepareGarbageCollectionCommits API
Execute the Spark GC job using the returned run_id
Verify results by checking that expired objects were deleted and retained objects remain

This pipeline must be orchestrated by an external scheduler because lakeFS itself does not include a built-in job scheduler. Common orchestration tools include:

cron — Simple, widely available, suitable for single-machine deployments
Apache Airflow — Full-featured workflow orchestration with dependency management, retries, and monitoring
AWS Step Functions — Serverless orchestration for AWS-native deployments
Kubernetes CronJob — Container-native scheduling for Kubernetes environments

Usage

Set up periodic GC scheduling when:

A repository is in active use and accumulating historical data versions
Storage costs need to be managed on an ongoing basis
Compliance requirements mandate timely deletion of expired data
The GC pipeline has been validated through manual runs and is ready for automation

Theoretical Basis

The scheduling of garbage collection involves a cost optimization trade-off between two competing concerns:

Storage cost: Increases continuously as new data versions accumulate. More frequent GC reduces storage cost by reclaiming space sooner.

Compute cost: Each GC run incurs compute costs (Spark cluster time, API calls, network I/O). More frequent GC increases compute cost.

The optimal frequency minimizes the total cost (storage + compute) over time:

GC Frequency	Storage Cost Impact	Compute Cost Impact	Typical Use Case
Daily	Lowest (expired data reclaimed within 24 hours)	Highest (365 runs/year)	High-churn repositories with expensive storage (e.g., large Parquet datasets on S3)
Weekly	Moderate (up to 7 days of expired data accumulation)	Moderate (52 runs/year)	Most production repositories (good balance of cost and freshness)
Monthly	Higher (up to 30 days of expired data accumulation)	Lowest (12 runs/year)	Low-churn repositories or repositories with inexpensive storage

The scheduling architecture also embodies the pipeline pattern from workflow orchestration theory. Each step in the GC pipeline has:

A clear precondition (e.g., Spark job requires a valid run_id from the preparation step)
A clear postcondition (e.g., preparation produces metadata files at known locations)
Idempotency where possible (e.g., the Spark job can be retried with the same run_id)

These properties make the pipeline suitable for orchestration tools that support automatic retries and dependency management.

The end-to-end automation of the pipeline (from rule configuration through verification) follows the closed-loop control principle: the system not only performs the action (deletion) but also verifies the outcome, enabling alerting on failures and building confidence in the ongoing operation.

Related Pages

Implemented By

Implementation:Treeverse_LakeFS_External_Scheduler_Configuration

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment