Implementation:Treeverse LakeFS PrepareGarbageCollectionCommits
| Knowledge Sources | |
|---|---|
| Domains | Storage_Management, REST_API |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
The prepareGarbageCollectionCommits API endpoint triggers the metadata preparation phase of garbage collection, generating structured files that identify expired commits and their associated physical addresses for subsequent deletion by the Spark GC job.
Description
This endpoint initiates a server-side process that:
- Reads the repository's configured retention rules (set via setGCRules)
- Walks the commit graph for all branches
- Identifies commits that have exceeded their branch's retention window
- Writes CSV/Parquet output files to the repository's backing storage
- Returns a response containing the run ID and locations of the generated files
The response includes the paths to the generated metadata files on the repository's backing object storage (e.g., S3, GCS, Azure Blob). These paths are passed to the Spark GC job as input.
The endpoint requires no request body. All necessary information (retention rules, commit graph) is read from the repository's internal state.
Usage
Call this endpoint as the second step of the GC pipeline, after configuring retention rules and before launching the Spark GC job. The returned run_id must be passed to the Spark job so it can locate the correct metadata files.
Code Reference
Source Location
- API specification:
api/swagger.ymllines 6527-6550 - Operation ID:
prepareGarbageCollectionCommits - HTTP method:
POST - Path:
/api/v1/repositories/{repository}/gc/prepare_commits
Signature
# Response Schema: GarbageCollectionPrepareResponse
GarbageCollectionPrepareResponse:
type: object
required:
- run_id
- gc_commits_location
- gc_addresses_location
properties:
run_id:
type: string
description: >
Unique identifier for this GC preparation run.
Passed to the Spark GC job to correlate preparation with execution.
gc_commits_location:
type: string
description: >
S3/GCS/Azure path to the CSV file containing commit liveness data.
gc_addresses_location:
type: string
description: >
S3/GCS/Azure path to the Parquet file containing expired physical addresses.
gc_commits_presigned_url:
type: string
description: >
Optional presigned URL for downloading the commits CSV directly.
Only populated when the server is configured to generate presigned URLs.
Import
# No SDK import required — this is a REST API call
curl -X POST http://localhost:8000/api/v1/repositories/{repository}/gc/prepare_commits \
-u "access_key:secret_key"
I/O Contract
Inputs
| Parameter | Location | Type | Required | Description |
|---|---|---|---|---|
repository |
Path | string | Yes | The repository name to prepare GC metadata for |
No request body is required. The endpoint reads retention rules and commit graph data from the repository's internal state.
Outputs
| Status Code | Body | Description |
|---|---|---|
| 201 | GarbageCollectionPrepareResponse | Metadata preparation completed successfully; response contains run_id and file locations |
| 401 | Error | Unauthorized: invalid or missing credentials |
| 404 | Error | Repository not found |
| 409 | Error | Conflict: another GC preparation is already in progress |
| 420 | Error | Too many requests: GC preparation rate limit exceeded |
Output File Formats
| File | Format | Contents |
|---|---|---|
| gc_commits_location | CSV | Columns: commit_id, branch, is_alive, commit_date, retention_days
|
| gc_addresses_location | Parquet | Columns: physical_address, commit_id, path
|
Usage Examples
Basic Preparation Call
# Trigger GC metadata preparation
curl -X POST http://localhost:8000/api/v1/repositories/my-repo/gc/prepare_commits \
-u "AKIAIOSFODNN7EXAMPLE:wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
# Example response:
# {
# "run_id": "gc_run_20260208_001",
# "gc_commits_location": "s3://my-repo-storage/gc/commits/gc_run_20260208_001.csv",
# "gc_addresses_location": "s3://my-repo-storage/gc/addresses/gc_run_20260208_001.parquet",
# "gc_commits_presigned_url": "https://my-repo-storage.s3.amazonaws.com/gc/commits/..."
# }
Capture Run ID for Spark Job
# Capture the run_id for use with the Spark GC job
RUN_ID=$(curl -s -X POST \
http://localhost:8000/api/v1/repositories/my-repo/gc/prepare_commits \
-u "AKIAIOSFODNN7EXAMPLE:wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY" \
| jq -r '.run_id')
echo "GC Run ID: $RUN_ID"
# Now pass $RUN_ID to the Spark GC job
docker run --rm treeverse/bitnami-spark:3.3 spark-submit \
--class io.treeverse.gc.GarbageCollection \
--conf spark.hadoop.lakefs.api.url=http://localhost:8000/api/v1 \
--conf spark.hadoop.lakefs.api.access_key=AKIAIOSFODNN7EXAMPLE \
--conf spark.hadoop.lakefs.api.secret_key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY \
/opt/metaclient/client.jar my-repo "$RUN_ID"
Python SDK Example
import lakefs_sdk
configuration = lakefs_sdk.Configuration(
host="http://localhost:8000/api/v1",
username="AKIAIOSFODNN7EXAMPLE",
password="wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
)
with lakefs_sdk.ApiClient(configuration) as api_client:
api = lakefs_sdk.RetentionApi(api_client)
response = api.prepare_garbage_collection_commits("my-repo")
print(f"Run ID: {response.run_id}")
print(f"Commits location: {response.gc_commits_location}")
print(f"Addresses location: {response.gc_addresses_location}")