Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Treeverse LakeFS Periodic GC Scheduling

From Leeroopedia


Knowledge Sources
Domains Storage_Management, Data_Lifecycle
Last Updated 2026-02-08 00:00 GMT

Overview

Periodic GC scheduling automates the garbage collection pipeline on a recurring basis, continuously reclaiming storage from expired data versions without manual intervention.

Description

Garbage collection is not a one-time operation. As a lakeFS repository accumulates commits over time, new objects continuously become eligible for collection as they age past their retention window. Without periodic execution, storage costs grow unbounded even though the data is no longer needed.

Periodic GC scheduling addresses this by automating the full GC pipeline as a recurring job:

  1. Configure retention rules (if rules need to be updated; typically a one-time or infrequent step)
  2. Prepare GC metadata by calling the prepareGarbageCollectionCommits API
  3. Execute the Spark GC job using the returned run_id
  4. Verify results by checking that expired objects were deleted and retained objects remain

This pipeline must be orchestrated by an external scheduler because lakeFS itself does not include a built-in job scheduler. Common orchestration tools include:

  • cron — Simple, widely available, suitable for single-machine deployments
  • Apache Airflow — Full-featured workflow orchestration with dependency management, retries, and monitoring
  • AWS Step Functions — Serverless orchestration for AWS-native deployments
  • Kubernetes CronJob — Container-native scheduling for Kubernetes environments

Usage

Set up periodic GC scheduling when:

  • A repository is in active use and accumulating historical data versions
  • Storage costs need to be managed on an ongoing basis
  • Compliance requirements mandate timely deletion of expired data
  • The GC pipeline has been validated through manual runs and is ready for automation

Theoretical Basis

The scheduling of garbage collection involves a cost optimization trade-off between two competing concerns:

Storage cost: Increases continuously as new data versions accumulate. More frequent GC reduces storage cost by reclaiming space sooner.

Compute cost: Each GC run incurs compute costs (Spark cluster time, API calls, network I/O). More frequent GC increases compute cost.

The optimal frequency minimizes the total cost (storage + compute) over time:

GC Frequency Storage Cost Impact Compute Cost Impact Typical Use Case
Daily Lowest (expired data reclaimed within 24 hours) Highest (365 runs/year) High-churn repositories with expensive storage (e.g., large Parquet datasets on S3)
Weekly Moderate (up to 7 days of expired data accumulation) Moderate (52 runs/year) Most production repositories (good balance of cost and freshness)
Monthly Higher (up to 30 days of expired data accumulation) Lowest (12 runs/year) Low-churn repositories or repositories with inexpensive storage

The scheduling architecture also embodies the pipeline pattern from workflow orchestration theory. Each step in the GC pipeline has:

  • A clear precondition (e.g., Spark job requires a valid run_id from the preparation step)
  • A clear postcondition (e.g., preparation produces metadata files at known locations)
  • Idempotency where possible (e.g., the Spark job can be retried with the same run_id)

These properties make the pipeline suitable for orchestration tools that support automatic retries and dependency management.

The end-to-end automation of the pipeline (from rule configuration through verification) follows the closed-loop control principle: the system not only performs the action (deletion) but also verifies the outcome, enabling alerting on failures and building confidence in the ongoing operation.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment