Implementation:Treeverse LakeFS RunSparkSubmit

Knowledge Sources	lakeFS lakeFS GC Docs
Domains	Storage_Management, REST_API
Last Updated	2026-02-08 00:00 GMT

Overview

RunSparkSubmit is an external tool function that executes the lakeFS garbage collection Spark job inside a Docker container, performing distributed deletion of expired objects from the underlying object storage.

Description

This function orchestrates the execution of a Spark job by:

Building the appropriate docker run command with the correct image, network settings, and volume mounts
Constructing the spark-submit arguments including the GC entry point class, Spark master URL, and lakeFS API configuration
Launching the container and streaming its output for monitoring
Returning the exit status to the caller

The function is defined in the lakeFS end-to-end test suite (esti) and relies on two helper functions: getSparkSubmitArgs (which builds the Spark submission arguments) and getDockerArgs (which builds the Docker execution arguments).

Usage

Use this function (or replicate its behavior) when:

Running GC as part of an automated pipeline after the metadata preparation step
Executing GC in CI/CD or end-to-end test environments
Deploying GC on managed Spark services (EMR, Dataproc) with equivalent configuration

Code Reference

Source Location

Primary function: esti/gc_test_utils.go lines 107-124 (RunSparkSubmit)
Helper — Spark args: esti/gc_test_utils.go lines 15-24 (getSparkSubmitArgs)
Helper — Docker args: esti/gc_test_utils.go lines 26-35 (getDockerArgs)

Signature

// SparkSubmitConfig holds all parameters for a Spark GC job submission.
type SparkSubmitConfig struct {
    SparkVersion    string      // Spark Docker image version tag (e.g., "3.3")
    LocalJar        string      // Path to the lakefs-spark-client JAR inside the container
    EntryPoint      string      // Main class (e.g., "io.treeverse.gc.GarbageCollection")
    ExtraSubmitArgs []string    // Additional spark-submit flags (e.g., "--conf ...")
    ProgramArgs     []string    // Arguments passed to the main class (e.g., repo, run_id)
    LogSource       string      // Logging identifier for output tagging
}

// RunSparkSubmit executes the Spark GC job in a Docker container.
func RunSparkSubmit(config SparkSubmitConfig) error

// getSparkSubmitArgs builds the spark-submit command-line arguments.
func getSparkSubmitArgs(config SparkSubmitConfig) []string

// getDockerArgs builds the docker run command-line arguments.
func getDockerArgs(config SparkSubmitConfig) []string

Import

# No import needed — this is executed via Docker + spark-submit
# The Docker image contains Spark and the lakeFS GC JAR

# Pull the image
docker pull treeverse/bitnami-spark:3.3

I/O Contract

Inputs

Parameter	Type	Required	Description
`SparkVersion`	string	Yes	Docker image tag for the Spark image (e.g., `"3.3"`)
`LocalJar`	string	Yes	Path to the `lakefs-spark-client` JAR file inside the container (e.g., `/opt/metaclient/client.jar`)
`EntryPoint`	string	Yes	Fully qualified Java class name (e.g., `io.treeverse.gc.GarbageCollection`)
`ExtraSubmitArgs`	[]string	No	Additional `spark-submit` flags (e.g., `--conf spark.executor.memory=4g`)
`ProgramArgs`	[]string	Yes	Arguments to the main class: typically `[repository, run_id]`
`LogSource`	string	No	Label for log output identification

Spark Configuration Properties (passed via --conf):

Property	Description
`spark.hadoop.lakefs.api.url`	lakeFS API endpoint (e.g., `http://localhost:8000/api/v1`)
`spark.hadoop.lakefs.api.access_key`	lakeFS access key ID
`spark.hadoop.lakefs.api.secret_key`	lakeFS secret access key
`spark.master`	Spark master URL (e.g., `spark://localhost:7077` or `local[*]`)

Outputs

Output	Description
Exit code 0	GC Spark job completed successfully; all expired objects were deleted
Exit code non-zero	Job failed; check container logs for details
Container stdout/stderr	Spark job output including deletion counts and any errors

Usage Examples

Basic Spark GC Execution

# Run the GC Spark job after obtaining a run_id from prepareGarbageCollectionCommits
docker run --rm \
  --network host \
  treeverse/bitnami-spark:3.3 \
  spark-submit \
    --class io.treeverse.gc.GarbageCollection \
    --master spark://localhost:7077 \
    --conf spark.hadoop.lakefs.api.url=http://localhost:8000/api/v1 \
    --conf spark.hadoop.lakefs.api.access_key=AKIAIOSFODNN7EXAMPLE \
    --conf spark.hadoop.lakefs.api.secret_key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY \
    /opt/metaclient/client.jar \
    my-repo \
    gc_run_20260208_001

Local Spark Mode (Single Machine)

# For development/testing — use local Spark master
docker run --rm \
  --network host \
  treeverse/bitnami-spark:3.3 \
  spark-submit \
    --class io.treeverse.gc.GarbageCollection \
    --master "local[*]" \
    --conf spark.hadoop.lakefs.api.url=http://localhost:8000/api/v1 \
    --conf spark.hadoop.lakefs.api.access_key=AKIAIOSFODNN7EXAMPLE \
    --conf spark.hadoop.lakefs.api.secret_key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY \
    --conf spark.driver.memory=4g \
    /opt/metaclient/client.jar \
    my-repo \
    gc_run_20260208_001

AWS EMR Execution

# Submit GC job to an existing EMR cluster
aws emr add-steps \
  --cluster-id j-XXXXXXXXXXXXX \
  --steps Type=Spark,Name="lakeFS GC",\
ActionOnFailure=CONTINUE,\
Args=[--class,io.treeverse.gc.GarbageCollection,\
--conf,spark.hadoop.lakefs.api.url=http://lakefs-host:8000/api/v1,\
--conf,spark.hadoop.lakefs.api.access_key=AKIAIOSFODNN7EXAMPLE,\
--conf,spark.hadoop.lakefs.api.secret_key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY,\
s3://my-bucket/jars/lakefs-spark-client.jar,\
my-repo,gc_run_20260208_001]

Related Pages

Implements Principle

Principle:Treeverse_LakeFS_GC_Job_Execution

Requires Environment

Environment:Treeverse_LakeFS_Spark_GC_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment