Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Treeverse LakeFS RunSparkSubmit

From Leeroopedia


Knowledge Sources
Domains Storage_Management, REST_API
Last Updated 2026-02-08 00:00 GMT

Overview

RunSparkSubmit is an external tool function that executes the lakeFS garbage collection Spark job inside a Docker container, performing distributed deletion of expired objects from the underlying object storage.

Description

This function orchestrates the execution of a Spark job by:

  1. Building the appropriate docker run command with the correct image, network settings, and volume mounts
  2. Constructing the spark-submit arguments including the GC entry point class, Spark master URL, and lakeFS API configuration
  3. Launching the container and streaming its output for monitoring
  4. Returning the exit status to the caller

The function is defined in the lakeFS end-to-end test suite (esti) and relies on two helper functions: getSparkSubmitArgs (which builds the Spark submission arguments) and getDockerArgs (which builds the Docker execution arguments).

Usage

Use this function (or replicate its behavior) when:

  • Running GC as part of an automated pipeline after the metadata preparation step
  • Executing GC in CI/CD or end-to-end test environments
  • Deploying GC on managed Spark services (EMR, Dataproc) with equivalent configuration

Code Reference

Source Location

  • Primary function: esti/gc_test_utils.go lines 107-124 (RunSparkSubmit)
  • Helper — Spark args: esti/gc_test_utils.go lines 15-24 (getSparkSubmitArgs)
  • Helper — Docker args: esti/gc_test_utils.go lines 26-35 (getDockerArgs)

Signature

// SparkSubmitConfig holds all parameters for a Spark GC job submission.
type SparkSubmitConfig struct {
    SparkVersion    string      // Spark Docker image version tag (e.g., "3.3")
    LocalJar        string      // Path to the lakefs-spark-client JAR inside the container
    EntryPoint      string      // Main class (e.g., "io.treeverse.gc.GarbageCollection")
    ExtraSubmitArgs []string    // Additional spark-submit flags (e.g., "--conf ...")
    ProgramArgs     []string    // Arguments passed to the main class (e.g., repo, run_id)
    LogSource       string      // Logging identifier for output tagging
}

// RunSparkSubmit executes the Spark GC job in a Docker container.
func RunSparkSubmit(config SparkSubmitConfig) error

// getSparkSubmitArgs builds the spark-submit command-line arguments.
func getSparkSubmitArgs(config SparkSubmitConfig) []string

// getDockerArgs builds the docker run command-line arguments.
func getDockerArgs(config SparkSubmitConfig) []string

Import

# No import needed — this is executed via Docker + spark-submit
# The Docker image contains Spark and the lakeFS GC JAR

# Pull the image
docker pull treeverse/bitnami-spark:3.3

I/O Contract

Inputs

Parameter Type Required Description
SparkVersion string Yes Docker image tag for the Spark image (e.g., "3.3")
LocalJar string Yes Path to the lakefs-spark-client JAR file inside the container (e.g., /opt/metaclient/client.jar)
EntryPoint string Yes Fully qualified Java class name (e.g., io.treeverse.gc.GarbageCollection)
ExtraSubmitArgs []string No Additional spark-submit flags (e.g., --conf spark.executor.memory=4g)
ProgramArgs []string Yes Arguments to the main class: typically [repository, run_id]
LogSource string No Label for log output identification

Spark Configuration Properties (passed via --conf):

Property Description
spark.hadoop.lakefs.api.url lakeFS API endpoint (e.g., http://localhost:8000/api/v1)
spark.hadoop.lakefs.api.access_key lakeFS access key ID
spark.hadoop.lakefs.api.secret_key lakeFS secret access key
spark.master Spark master URL (e.g., spark://localhost:7077 or local[*])

Outputs

Output Description
Exit code 0 GC Spark job completed successfully; all expired objects were deleted
Exit code non-zero Job failed; check container logs for details
Container stdout/stderr Spark job output including deletion counts and any errors

Usage Examples

Basic Spark GC Execution

# Run the GC Spark job after obtaining a run_id from prepareGarbageCollectionCommits
docker run --rm \
  --network host \
  treeverse/bitnami-spark:3.3 \
  spark-submit \
    --class io.treeverse.gc.GarbageCollection \
    --master spark://localhost:7077 \
    --conf spark.hadoop.lakefs.api.url=http://localhost:8000/api/v1 \
    --conf spark.hadoop.lakefs.api.access_key=AKIAIOSFODNN7EXAMPLE \
    --conf spark.hadoop.lakefs.api.secret_key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY \
    /opt/metaclient/client.jar \
    my-repo \
    gc_run_20260208_001

Local Spark Mode (Single Machine)

# For development/testing — use local Spark master
docker run --rm \
  --network host \
  treeverse/bitnami-spark:3.3 \
  spark-submit \
    --class io.treeverse.gc.GarbageCollection \
    --master "local[*]" \
    --conf spark.hadoop.lakefs.api.url=http://localhost:8000/api/v1 \
    --conf spark.hadoop.lakefs.api.access_key=AKIAIOSFODNN7EXAMPLE \
    --conf spark.hadoop.lakefs.api.secret_key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY \
    --conf spark.driver.memory=4g \
    /opt/metaclient/client.jar \
    my-repo \
    gc_run_20260208_001

AWS EMR Execution

# Submit GC job to an existing EMR cluster
aws emr add-steps \
  --cluster-id j-XXXXXXXXXXXXX \
  --steps Type=Spark,Name="lakeFS GC",\
ActionOnFailure=CONTINUE,\
Args=[--class,io.treeverse.gc.GarbageCollection,\
--conf,spark.hadoop.lakefs.api.url=http://lakefs-host:8000/api/v1,\
--conf,spark.hadoop.lakefs.api.access_key=AKIAIOSFODNN7EXAMPLE,\
--conf,spark.hadoop.lakefs.api.secret_key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY,\
s3://my-bucket/jars/lakefs-spark-client.jar,\
my-repo,gc_run_20260208_001]

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment