Environment:Treeverse LakeFS Spark GC Environment

Knowledge Sources	Treeverse lakeFS lakeFS GC Documentation
Domains	Infrastructure, Data_Platform, Distributed_Computing
Last Updated	2026-02-08 10:00 GMT

Overview

Apache Spark 3.2+ environment with Java 8, Docker, and AWS credentials for running lakeFS garbage collection jobs.

Description

The lakeFS garbage collection workflow requires an external Spark job to sweep unreferenced objects from the underlying object storage. This environment provides the Spark runtime via Docker (treeverse/bitnami-spark:3.2.3), Java 8 for the Spark executor, and network access to both the lakeFS API and the object storage backend. The GC job runs as a `spark-submit` command inside a Docker container that connects to the lakeFS server over the host network.

Usage

Use this environment when executing the garbage collection workflow (WF5). It is the mandatory prerequisite for the Implementation:Treeverse_LakeFS_RunSparkSubmit implementation. The Spark job reads the GC preparation metadata from the lakeFS API and deletes unreferenced objects from the storage backend.

System Requirements

Category	Requirement	Notes
OS	Linux (Docker host)	Docker required for Spark container
Hardware	4+ CPU cores, 8GB+ RAM	Spark driver and executor memory requirements
Disk	10GB+ free space	For Spark working directory, Ivy cache, and temporary files
Network	Host networking mode	Container uses `--network host` for lakeFS and storage access

Dependencies

System Packages

`docker` (Docker Engine for running Spark container)

Container Image

`treeverse/bitnami-spark:3.2.3` (Spark 3.2.1 with Hadoop support)

Java Runtime

Java 8 (OpenJDK 8u242 in Hive metastore; Spark image includes its own JVM)

Spark Configuration

Master: `spark://localhost:7077`
Hadoop-lakeFS connector JAR (mounted as `/opt/metaclient/client.jar`)

Credentials

The following environment variables must be available to the Docker container:

`AWS_ACCESS_KEY_ID`: AWS access key for object storage operations.
`AWS_SECRET_ACCESS_KEY`: AWS secret key for object storage operations.
Spark config `spark.hadoop.lakefs.api.access_key`: lakeFS API access key.
Spark config `spark.hadoop.lakefs.api.secret_key`: lakeFS API secret key.

Quick Install

# Pull the Spark GC image
docker pull treeverse/bitnami-spark:3.2.3

# Run GC spark-submit (example)
docker run --network host --add-host lakefs:127.0.0.1 \
  -e AWS_ACCESS_KEY_ID -e AWS_SECRET_ACCESS_KEY \
  -v /path/to/client.jar:/opt/metaclient/client.jar \
  --rm treeverse/bitnami-spark:3.2.3 spark-submit \
  --master spark://localhost:7077 \
  --conf spark.hadoop.lakefs.api.url=http://lakefs:8000/api/v1 \
  --conf spark.hadoop.lakefs.api.access_key=YOUR_KEY \
  --conf spark.hadoop.lakefs.api.secret_key=YOUR_SECRET \
  --class io.treeverse.gc.GarbageCollection \
  /opt/metaclient/client.jar

Code Evidence

Docker container arguments from `esti/gc_test_utils.go:26-35`:

func getDockerArgs(workingDirectory string, localJar string) []string {
    return []string{
        "run", "--network", "host", "--add-host", "lakefs:127.0.0.1",
        "-v", fmt.Sprintf("%s/ivy:/opt/bitnami/spark/.ivy2", workingDirectory),
        "-v", fmt.Sprintf("%s:/opt/metaclient/client.jar", localJar),
        "--rm",
        "-e", "AWS_ACCESS_KEY_ID",
        "-e", "AWS_SECRET_ACCESS_KEY",
    }
}

Spark submit arguments from `esti/gc_test_utils.go:15-24`:

func getSparkSubmitArgs(entryPoint string) []string {
    return []string{
        "--master", "spark://localhost:7077",
        "--conf", "spark.driver.extraJavaOptions=-Divy.cache.dir=/tmp -Divy.home=/tmp",
        "--conf", "spark.hadoop.lakefs.api.url=http://lakefs:8000" + apiutil.BaseURL,
        "--conf", "spark.hadoop.lakefs.api.access_key=AKIAIOSFDNN7EXAMPLEQ",
        "--conf", "spark.hadoop.lakefs.api.secret_key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
        "--class", entryPoint,
    }
}

RunSparkSubmit orchestration from `esti/gc_test_utils.go:107-124`:

func RunSparkSubmit(config *SparkSubmitConfig) error {
    workingDirectory, err := os.Getwd()
    // ...
    dockerArgs := getDockerArgs(workingDirectory, config.LocalJar)
    dockerArgs = append(dockerArgs, fmt.Sprintf("docker.io/treeverse/bitnami-spark:%s", config.SparkVersion), "spark-submit")
    sparkSubmitArgs := getSparkSubmitArgs(config.EntryPoint)
    // ...
    cmd := exec.Command("docker", args...)
    return runCommand(config.LogSource, cmd)
}

Common Errors

Error Message	Cause	Solution
`docker: command not found`	Docker not installed	Install Docker Engine on the host
Spark connection refused to lakefs:8000	lakeFS server not running or host networking issue	Ensure lakeFS is running and `--network host --add-host lakefs:127.0.0.1` is set
`ClassNotFoundException`	Wrong entry point class	Verify `--class` matches the GC JAR contents
AWS credentials error	Missing AWS_ACCESS_KEY_ID/SECRET	Export AWS credentials before running docker command

Compatibility Notes

Spark version: Pinned to 3.2.3 for compatibility with the lakeFS Hadoop connector. Do not upgrade without testing.
Java version: Spark 3.2 requires Java 8 or 11. The container image includes Java 8.
Host networking: Required because the Spark container needs to reach the lakeFS server on localhost.
Ivy cache: Mounted from the working directory to avoid re-downloading dependencies on each run.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment