Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Treeverse LakeFS Retention Rule Configuration

From Leeroopedia


Knowledge Sources
Domains Storage_Management, Data_Lifecycle
Last Updated 2026-02-08 00:00 GMT

Overview

Retention rule configuration defines how long data versions are preserved in a lakeFS repository before becoming eligible for garbage collection.

Description

Retention rules are the foundational policy mechanism that governs the storage lifecycle of committed data in lakeFS. Every repository must have a default retention period (measured in days from the commit creation date) that applies uniformly to all branches. In addition, per-branch overrides can be specified to accommodate branches with differing retention requirements.

For example, a production branch (e.g., main) may require 30 or more days of retention to support auditing and rollback scenarios, while short-lived feature branches may only need 7 days of retention. This two-tier model (default + per-branch overrides) balances simplicity with the flexibility needed in real-world data engineering workflows.

The retention period is always measured in calendar days from the timestamp when a commit was created. Once a commit's age exceeds the applicable retention period for its branch, the objects exclusively referenced by that commit become candidates for garbage collection. Objects that are still referenced by at least one commit within any branch's retention window are never eligible for deletion.

Key properties of retention rules:

  • Default retention applies to every branch that does not have an explicit override
  • Per-branch overrides take precedence over the default for their specific branch
  • Retention is measured from the commit creation date, not the object upload date
  • Rules are evaluated during the GC metadata preparation phase, not at write time
  • Changing retention rules takes effect on the next GC preparation run; it does not retroactively affect already-prepared metadata

Usage

Configure retention rules when:

  • Setting up a new repository that will accumulate historical data versions
  • Adjusting storage costs by shortening retention on branches with high churn
  • Extending retention on regulated or production branches to meet compliance requirements
  • Preparing a repository for its first garbage collection run

Theoretical Basis

The retention rule model in lakeFS follows the principle of declarative lifecycle management: operators declare what the desired retention state is, and the system determines how to achieve it. This is analogous to object lifecycle policies in cloud storage systems (e.g., S3 Lifecycle Rules, GCS Object Lifecycle Management), but operates at the version level rather than the object level.

The two-tier structure (default + overrides) draws from the policy inheritance pattern common in access control and configuration management systems. A sensible default covers the majority of cases, while explicit overrides handle exceptions without requiring exhaustive per-branch configuration.

Retention measurement from commit creation date (rather than object creation date) ensures that the retention semantics are deterministic and reproducible: the same commit will always have the same expiration date regardless of when the underlying objects were first uploaded. This simplifies reasoning about GC behavior and makes it possible to predict exactly when data will become eligible for collection.

Concept Description
Default Retention A repository-wide retention period (in days) that applies to all branches unless overridden
Per-Branch Override An explicit retention period for a specific branch that takes precedence over the default
Commit Age The number of days elapsed since a commit was created; compared against the retention period
Eligible for GC A commit whose age exceeds its branch's retention period, making its exclusively-referenced objects candidates for deletion

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment