Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Treeverse LakeFS PrepareGarbageCollectionCommits

From Leeroopedia


Knowledge Sources
Domains Storage_Management, REST_API
Last Updated 2026-02-08 00:00 GMT

Overview

The prepareGarbageCollectionCommits API endpoint triggers the metadata preparation phase of garbage collection, generating structured files that identify expired commits and their associated physical addresses for subsequent deletion by the Spark GC job.

Description

This endpoint initiates a server-side process that:

  1. Reads the repository's configured retention rules (set via setGCRules)
  2. Walks the commit graph for all branches
  3. Identifies commits that have exceeded their branch's retention window
  4. Writes CSV/Parquet output files to the repository's backing storage
  5. Returns a response containing the run ID and locations of the generated files

The response includes the paths to the generated metadata files on the repository's backing object storage (e.g., S3, GCS, Azure Blob). These paths are passed to the Spark GC job as input.

The endpoint requires no request body. All necessary information (retention rules, commit graph) is read from the repository's internal state.

Usage

Call this endpoint as the second step of the GC pipeline, after configuring retention rules and before launching the Spark GC job. The returned run_id must be passed to the Spark job so it can locate the correct metadata files.

Code Reference

Source Location

  • API specification: api/swagger.yml lines 6527-6550
  • Operation ID: prepareGarbageCollectionCommits
  • HTTP method: POST
  • Path: /api/v1/repositories/{repository}/gc/prepare_commits

Signature

# Response Schema: GarbageCollectionPrepareResponse
GarbageCollectionPrepareResponse:
  type: object
  required:
    - run_id
    - gc_commits_location
    - gc_addresses_location
  properties:
    run_id:
      type: string
      description: >
        Unique identifier for this GC preparation run.
        Passed to the Spark GC job to correlate preparation with execution.
    gc_commits_location:
      type: string
      description: >
        S3/GCS/Azure path to the CSV file containing commit liveness data.
    gc_addresses_location:
      type: string
      description: >
        S3/GCS/Azure path to the Parquet file containing expired physical addresses.
    gc_commits_presigned_url:
      type: string
      description: >
        Optional presigned URL for downloading the commits CSV directly.
        Only populated when the server is configured to generate presigned URLs.

Import

# No SDK import required — this is a REST API call
curl -X POST http://localhost:8000/api/v1/repositories/{repository}/gc/prepare_commits \
  -u "access_key:secret_key"

I/O Contract

Inputs

Parameter Location Type Required Description
repository Path string Yes The repository name to prepare GC metadata for

No request body is required. The endpoint reads retention rules and commit graph data from the repository's internal state.

Outputs

Status Code Body Description
201 GarbageCollectionPrepareResponse Metadata preparation completed successfully; response contains run_id and file locations
401 Error Unauthorized: invalid or missing credentials
404 Error Repository not found
409 Error Conflict: another GC preparation is already in progress
420 Error Too many requests: GC preparation rate limit exceeded

Output File Formats

File Format Contents
gc_commits_location CSV Columns: commit_id, branch, is_alive, commit_date, retention_days
gc_addresses_location Parquet Columns: physical_address, commit_id, path

Usage Examples

Basic Preparation Call

# Trigger GC metadata preparation
curl -X POST http://localhost:8000/api/v1/repositories/my-repo/gc/prepare_commits \
  -u "AKIAIOSFODNN7EXAMPLE:wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"

# Example response:
# {
#   "run_id": "gc_run_20260208_001",
#   "gc_commits_location": "s3://my-repo-storage/gc/commits/gc_run_20260208_001.csv",
#   "gc_addresses_location": "s3://my-repo-storage/gc/addresses/gc_run_20260208_001.parquet",
#   "gc_commits_presigned_url": "https://my-repo-storage.s3.amazonaws.com/gc/commits/..."
# }

Capture Run ID for Spark Job

# Capture the run_id for use with the Spark GC job
RUN_ID=$(curl -s -X POST \
  http://localhost:8000/api/v1/repositories/my-repo/gc/prepare_commits \
  -u "AKIAIOSFODNN7EXAMPLE:wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY" \
  | jq -r '.run_id')

echo "GC Run ID: $RUN_ID"

# Now pass $RUN_ID to the Spark GC job
docker run --rm treeverse/bitnami-spark:3.3 spark-submit \
  --class io.treeverse.gc.GarbageCollection \
  --conf spark.hadoop.lakefs.api.url=http://localhost:8000/api/v1 \
  --conf spark.hadoop.lakefs.api.access_key=AKIAIOSFODNN7EXAMPLE \
  --conf spark.hadoop.lakefs.api.secret_key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY \
  /opt/metaclient/client.jar my-repo "$RUN_ID"

Python SDK Example

import lakefs_sdk

configuration = lakefs_sdk.Configuration(
    host="http://localhost:8000/api/v1",
    username="AKIAIOSFODNN7EXAMPLE",
    password="wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
)

with lakefs_sdk.ApiClient(configuration) as api_client:
    api = lakefs_sdk.RetentionApi(api_client)
    response = api.prepare_garbage_collection_commits("my-repo")
    print(f"Run ID:             {response.run_id}")
    print(f"Commits location:   {response.gc_commits_location}")
    print(f"Addresses location: {response.gc_addresses_location}")

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment