Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Treeverse LakeFS GC Metadata Preparation

From Leeroopedia


Knowledge Sources
Domains Storage_Management, Data_Lifecycle
Last Updated 2026-02-08 00:00 GMT

Overview

GC metadata preparation identifies expired commits and their associated physical addresses, producing structured output files that drive the subsequent garbage collection deletion phase.

Description

Before any objects can be safely deleted from the underlying object storage, the system must determine which objects are still needed and which are eligible for removal. This is the metadata preparation phase of the garbage collection pipeline.

The preparation process works by:

  1. Enumerating all branches in the repository and their configured retention periods
  2. Walking the commit graph for each branch to identify commits whose age exceeds the branch's retention window
  3. Collecting physical addresses referenced exclusively by expired commits (i.e., addresses not referenced by any still-alive commit on any branch)
  4. Writing the results to structured files (CSV for commits, Parquet for addresses) in a well-known location on the repository's backing object storage

The output of this phase is a self-contained dataset that the downstream Spark GC job can consume independently. This includes:

  • A commits CSV listing which commits are alive and which are expired
  • An addresses file listing the physical storage paths of objects to be deleted
  • A run ID that uniquely identifies this preparation run, used to correlate with the Spark job

This phase is deliberately read-only with respect to the object store's data objects. It reads metadata (commits, references, manifests) but does not delete anything. Deletion is deferred to the Spark job.

Usage

Trigger metadata preparation when:

  • Running the GC pipeline on a schedule (the preparation step always precedes the Spark deletion job)
  • Performing a dry run to understand what would be deleted without actually deleting anything
  • Debugging GC behavior by inspecting the generated CSV/Parquet files
  • Testing retention rule changes before committing to a full GC run

Theoretical Basis

The separation of identification from execution in the GC pipeline follows the classic two-phase pattern used in distributed garbage collection systems:

Phase 1 (Mark): Identify live and dead objects by traversing the reference graph. In lakeFS, this is the metadata preparation step, which walks the commit DAG and determines liveness based on retention rules.

Phase 2 (Sweep): Delete the dead objects. In lakeFS, this is delegated to the Spark job, which performs bulk deletion against the object store.

This separation provides several advantages:

Advantage Explanation
Scalability The mark phase runs inside the lakeFS server (which has access to the commit graph), while the sweep phase runs in Spark (which can parallelize deletion across hundreds of executors)
Safety The mark phase output can be inspected before the sweep runs, preventing accidental data loss
Idempotency If the Spark job fails partway through, it can be re-run with the same run ID without re-preparing metadata
Auditability The CSV/Parquet files serve as a permanent record of what was marked for deletion and why

The use of CSV and Parquet as interchange formats is deliberate: these formats are natively supported by Apache Spark and can be easily inspected with standard data tools (e.g., DuckDB, pandas, AWS Athena).

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment