Implementation:Treeverse LakeFS PrepareGarbageCollectionCommits

Knowledge Sources	lakeFS lakeFS GC Docs
Domains	Storage_Management, REST_API
Last Updated	2026-02-08 00:00 GMT

Overview

The prepareGarbageCollectionCommits API endpoint triggers the metadata preparation phase of garbage collection, generating structured files that identify expired commits and their associated physical addresses for subsequent deletion by the Spark GC job.

Description

This endpoint initiates a server-side process that:

Reads the repository's configured retention rules (set via setGCRules)
Walks the commit graph for all branches
Identifies commits that have exceeded their branch's retention window
Writes CSV/Parquet output files to the repository's backing storage
Returns a response containing the run ID and locations of the generated files

The response includes the paths to the generated metadata files on the repository's backing object storage (e.g., S3, GCS, Azure Blob). These paths are passed to the Spark GC job as input.

The endpoint requires no request body. All necessary information (retention rules, commit graph) is read from the repository's internal state.

Usage

Call this endpoint as the second step of the GC pipeline, after configuring retention rules and before launching the Spark GC job. The returned run_id must be passed to the Spark job so it can locate the correct metadata files.

Code Reference

Source Location

API specification: api/swagger.yml lines 6527-6550
Operation ID: prepareGarbageCollectionCommits
HTTP method: POST
Path: /api/v1/repositories/{repository}/gc/prepare_commits

Signature

# Response Schema: GarbageCollectionPrepareResponse
GarbageCollectionPrepareResponse:
  type: object
  required:
    - run_id
    - gc_commits_location
    - gc_addresses_location
  properties:
    run_id:
      type: string
      description: >
        Unique identifier for this GC preparation run.
        Passed to the Spark GC job to correlate preparation with execution.
    gc_commits_location:
      type: string
      description: >
        S3/GCS/Azure path to the CSV file containing commit liveness data.
    gc_addresses_location:
      type: string
      description: >
        S3/GCS/Azure path to the Parquet file containing expired physical addresses.
    gc_commits_presigned_url:
      type: string
      description: >
        Optional presigned URL for downloading the commits CSV directly.
        Only populated when the server is configured to generate presigned URLs.

Import

# No SDK import required — this is a REST API call
curl -X POST http://localhost:8000/api/v1/repositories/{repository}/gc/prepare_commits \
  -u "access_key:secret_key"

I/O Contract

Inputs

Parameter	Location	Type	Required	Description
`repository`	Path	string	Yes	The repository name to prepare GC metadata for

No request body is required. The endpoint reads retention rules and commit graph data from the repository's internal state.

Outputs

Status Code	Body	Description
201	GarbageCollectionPrepareResponse	Metadata preparation completed successfully; response contains run_id and file locations
401	Error	Unauthorized: invalid or missing credentials
404	Error	Repository not found
409	Error	Conflict: another GC preparation is already in progress
420	Error	Too many requests: GC preparation rate limit exceeded

Output File Formats

File	Format	Contents
gc_commits_location	CSV	Columns: `commit_id`, `branch`, `is_alive`, `commit_date`, `retention_days`
gc_addresses_location	Parquet	Columns: `physical_address`, `commit_id`, `path`

Usage Examples

Basic Preparation Call

# Trigger GC metadata preparation
curl -X POST http://localhost:8000/api/v1/repositories/my-repo/gc/prepare_commits \
  -u "AKIAIOSFODNN7EXAMPLE:wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"

# Example response:
# {
#   "run_id": "gc_run_20260208_001",
#   "gc_commits_location": "s3://my-repo-storage/gc/commits/gc_run_20260208_001.csv",
#   "gc_addresses_location": "s3://my-repo-storage/gc/addresses/gc_run_20260208_001.parquet",
#   "gc_commits_presigned_url": "https://my-repo-storage.s3.amazonaws.com/gc/commits/..."
# }

Capture Run ID for Spark Job

# Capture the run_id for use with the Spark GC job
RUN_ID=$(curl -s -X POST \
  http://localhost:8000/api/v1/repositories/my-repo/gc/prepare_commits \
  -u "AKIAIOSFODNN7EXAMPLE:wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY" \
  | jq -r '.run_id')

echo "GC Run ID: $RUN_ID"

# Now pass $RUN_ID to the Spark GC job
docker run --rm treeverse/bitnami-spark:3.3 spark-submit \
  --class io.treeverse.gc.GarbageCollection \
  --conf spark.hadoop.lakefs.api.url=http://localhost:8000/api/v1 \
  --conf spark.hadoop.lakefs.api.access_key=AKIAIOSFODNN7EXAMPLE \
  --conf spark.hadoop.lakefs.api.secret_key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY \
  /opt/metaclient/client.jar my-repo "$RUN_ID"

Python SDK Example

import lakefs_sdk

configuration = lakefs_sdk.Configuration(
    host="http://localhost:8000/api/v1",
    username="AKIAIOSFODNN7EXAMPLE",
    password="wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
)

with lakefs_sdk.ApiClient(configuration) as api_client:
    api = lakefs_sdk.RetentionApi(api_client)
    response = api.prepare_garbage_collection_commits("my-repo")
    print(f"Run ID:             {response.run_id}")
    print(f"Commits location:   {response.gc_commits_location}")
    print(f"Addresses location: {response.gc_addresses_location}")

Related Pages

Implements Principle

Principle:Treeverse_LakeFS_GC_Metadata_Preparation

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment