Implementation:Treeverse LakeFS RunSparkSubmit
| Knowledge Sources | |
|---|---|
| Domains | Storage_Management, REST_API |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
RunSparkSubmit is an external tool function that executes the lakeFS garbage collection Spark job inside a Docker container, performing distributed deletion of expired objects from the underlying object storage.
Description
This function orchestrates the execution of a Spark job by:
- Building the appropriate
docker runcommand with the correct image, network settings, and volume mounts - Constructing the
spark-submitarguments including the GC entry point class, Spark master URL, and lakeFS API configuration - Launching the container and streaming its output for monitoring
- Returning the exit status to the caller
The function is defined in the lakeFS end-to-end test suite (esti) and relies on two helper functions: getSparkSubmitArgs (which builds the Spark submission arguments) and getDockerArgs (which builds the Docker execution arguments).
Usage
Use this function (or replicate its behavior) when:
- Running GC as part of an automated pipeline after the metadata preparation step
- Executing GC in CI/CD or end-to-end test environments
- Deploying GC on managed Spark services (EMR, Dataproc) with equivalent configuration
Code Reference
Source Location
- Primary function:
esti/gc_test_utils.golines 107-124 (RunSparkSubmit) - Helper — Spark args:
esti/gc_test_utils.golines 15-24 (getSparkSubmitArgs) - Helper — Docker args:
esti/gc_test_utils.golines 26-35 (getDockerArgs)
Signature
// SparkSubmitConfig holds all parameters for a Spark GC job submission.
type SparkSubmitConfig struct {
SparkVersion string // Spark Docker image version tag (e.g., "3.3")
LocalJar string // Path to the lakefs-spark-client JAR inside the container
EntryPoint string // Main class (e.g., "io.treeverse.gc.GarbageCollection")
ExtraSubmitArgs []string // Additional spark-submit flags (e.g., "--conf ...")
ProgramArgs []string // Arguments passed to the main class (e.g., repo, run_id)
LogSource string // Logging identifier for output tagging
}
// RunSparkSubmit executes the Spark GC job in a Docker container.
func RunSparkSubmit(config SparkSubmitConfig) error
// getSparkSubmitArgs builds the spark-submit command-line arguments.
func getSparkSubmitArgs(config SparkSubmitConfig) []string
// getDockerArgs builds the docker run command-line arguments.
func getDockerArgs(config SparkSubmitConfig) []string
Import
# No import needed — this is executed via Docker + spark-submit
# The Docker image contains Spark and the lakeFS GC JAR
# Pull the image
docker pull treeverse/bitnami-spark:3.3
I/O Contract
Inputs
| Parameter | Type | Required | Description |
|---|---|---|---|
SparkVersion |
string | Yes | Docker image tag for the Spark image (e.g., "3.3")
|
LocalJar |
string | Yes | Path to the lakefs-spark-client JAR file inside the container (e.g., /opt/metaclient/client.jar)
|
EntryPoint |
string | Yes | Fully qualified Java class name (e.g., io.treeverse.gc.GarbageCollection)
|
ExtraSubmitArgs |
[]string | No | Additional spark-submit flags (e.g., --conf spark.executor.memory=4g)
|
ProgramArgs |
[]string | Yes | Arguments to the main class: typically [repository, run_id]
|
LogSource |
string | No | Label for log output identification |
Spark Configuration Properties (passed via --conf):
| Property | Description |
|---|---|
spark.hadoop.lakefs.api.url |
lakeFS API endpoint (e.g., http://localhost:8000/api/v1)
|
spark.hadoop.lakefs.api.access_key |
lakeFS access key ID |
spark.hadoop.lakefs.api.secret_key |
lakeFS secret access key |
spark.master |
Spark master URL (e.g., spark://localhost:7077 or local[*])
|
Outputs
| Output | Description |
|---|---|
| Exit code 0 | GC Spark job completed successfully; all expired objects were deleted |
| Exit code non-zero | Job failed; check container logs for details |
| Container stdout/stderr | Spark job output including deletion counts and any errors |
Usage Examples
Basic Spark GC Execution
# Run the GC Spark job after obtaining a run_id from prepareGarbageCollectionCommits
docker run --rm \
--network host \
treeverse/bitnami-spark:3.3 \
spark-submit \
--class io.treeverse.gc.GarbageCollection \
--master spark://localhost:7077 \
--conf spark.hadoop.lakefs.api.url=http://localhost:8000/api/v1 \
--conf spark.hadoop.lakefs.api.access_key=AKIAIOSFODNN7EXAMPLE \
--conf spark.hadoop.lakefs.api.secret_key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY \
/opt/metaclient/client.jar \
my-repo \
gc_run_20260208_001
Local Spark Mode (Single Machine)
# For development/testing — use local Spark master
docker run --rm \
--network host \
treeverse/bitnami-spark:3.3 \
spark-submit \
--class io.treeverse.gc.GarbageCollection \
--master "local[*]" \
--conf spark.hadoop.lakefs.api.url=http://localhost:8000/api/v1 \
--conf spark.hadoop.lakefs.api.access_key=AKIAIOSFODNN7EXAMPLE \
--conf spark.hadoop.lakefs.api.secret_key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY \
--conf spark.driver.memory=4g \
/opt/metaclient/client.jar \
my-repo \
gc_run_20260208_001
AWS EMR Execution
# Submit GC job to an existing EMR cluster
aws emr add-steps \
--cluster-id j-XXXXXXXXXXXXX \
--steps Type=Spark,Name="lakeFS GC",\
ActionOnFailure=CONTINUE,\
Args=[--class,io.treeverse.gc.GarbageCollection,\
--conf,spark.hadoop.lakefs.api.url=http://lakefs-host:8000/api/v1,\
--conf,spark.hadoop.lakefs.api.access_key=AKIAIOSFODNN7EXAMPLE,\
--conf,spark.hadoop.lakefs.api.secret_key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY,\
s3://my-bucket/jars/lakefs-spark-client.jar,\
my-repo,gc_run_20260208_001]