Principle:Treeverse LakeFS Retention Rule Configuration
| Knowledge Sources | |
|---|---|
| Domains | Storage_Management, Data_Lifecycle |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Retention rule configuration defines how long data versions are preserved in a lakeFS repository before becoming eligible for garbage collection.
Description
Retention rules are the foundational policy mechanism that governs the storage lifecycle of committed data in lakeFS. Every repository must have a default retention period (measured in days from the commit creation date) that applies uniformly to all branches. In addition, per-branch overrides can be specified to accommodate branches with differing retention requirements.
For example, a production branch (e.g., main) may require 30 or more days of retention to support auditing and rollback scenarios, while short-lived feature branches may only need 7 days of retention. This two-tier model (default + per-branch overrides) balances simplicity with the flexibility needed in real-world data engineering workflows.
The retention period is always measured in calendar days from the timestamp when a commit was created. Once a commit's age exceeds the applicable retention period for its branch, the objects exclusively referenced by that commit become candidates for garbage collection. Objects that are still referenced by at least one commit within any branch's retention window are never eligible for deletion.
Key properties of retention rules:
- Default retention applies to every branch that does not have an explicit override
- Per-branch overrides take precedence over the default for their specific branch
- Retention is measured from the commit creation date, not the object upload date
- Rules are evaluated during the GC metadata preparation phase, not at write time
- Changing retention rules takes effect on the next GC preparation run; it does not retroactively affect already-prepared metadata
Usage
Configure retention rules when:
- Setting up a new repository that will accumulate historical data versions
- Adjusting storage costs by shortening retention on branches with high churn
- Extending retention on regulated or production branches to meet compliance requirements
- Preparing a repository for its first garbage collection run
Theoretical Basis
The retention rule model in lakeFS follows the principle of declarative lifecycle management: operators declare what the desired retention state is, and the system determines how to achieve it. This is analogous to object lifecycle policies in cloud storage systems (e.g., S3 Lifecycle Rules, GCS Object Lifecycle Management), but operates at the version level rather than the object level.
The two-tier structure (default + overrides) draws from the policy inheritance pattern common in access control and configuration management systems. A sensible default covers the majority of cases, while explicit overrides handle exceptions without requiring exhaustive per-branch configuration.
Retention measurement from commit creation date (rather than object creation date) ensures that the retention semantics are deterministic and reproducible: the same commit will always have the same expiration date regardless of when the underlying objects were first uploaded. This simplifies reasoning about GC behavior and makes it possible to predict exactly when data will become eligible for collection.
| Concept | Description |
|---|---|
| Default Retention | A repository-wide retention period (in days) that applies to all branches unless overridden |
| Per-Branch Override | An explicit retention period for a specific branch that takes precedence over the default |
| Commit Age | The number of days elapsed since a commit was created; compared against the retention period |
| Eligible for GC | A commit whose age exceeds its branch's retention period, making its exclusively-referenced objects candidates for deletion |