Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Environment:Treeverse LakeFS Spark GC Environment

From Leeroopedia


Knowledge Sources
Domains Infrastructure, Data_Platform, Distributed_Computing
Last Updated 2026-02-08 10:00 GMT

Overview

Apache Spark 3.2+ environment with Java 8, Docker, and AWS credentials for running lakeFS garbage collection jobs.

Description

The lakeFS garbage collection workflow requires an external Spark job to sweep unreferenced objects from the underlying object storage. This environment provides the Spark runtime via Docker (treeverse/bitnami-spark:3.2.3), Java 8 for the Spark executor, and network access to both the lakeFS API and the object storage backend. The GC job runs as a `spark-submit` command inside a Docker container that connects to the lakeFS server over the host network.

Usage

Use this environment when executing the garbage collection workflow (WF5). It is the mandatory prerequisite for the Implementation:Treeverse_LakeFS_RunSparkSubmit implementation. The Spark job reads the GC preparation metadata from the lakeFS API and deletes unreferenced objects from the storage backend.

System Requirements

Category Requirement Notes
OS Linux (Docker host) Docker required for Spark container
Hardware 4+ CPU cores, 8GB+ RAM Spark driver and executor memory requirements
Disk 10GB+ free space For Spark working directory, Ivy cache, and temporary files
Network Host networking mode Container uses `--network host` for lakeFS and storage access

Dependencies

System Packages

  • `docker` (Docker Engine for running Spark container)

Container Image

  • `treeverse/bitnami-spark:3.2.3` (Spark 3.2.1 with Hadoop support)

Java Runtime

  • Java 8 (OpenJDK 8u242 in Hive metastore; Spark image includes its own JVM)

Spark Configuration

  • Master: `spark://localhost:7077`
  • Hadoop-lakeFS connector JAR (mounted as `/opt/metaclient/client.jar`)

Credentials

The following environment variables must be available to the Docker container:

  • `AWS_ACCESS_KEY_ID`: AWS access key for object storage operations.
  • `AWS_SECRET_ACCESS_KEY`: AWS secret key for object storage operations.
  • Spark config `spark.hadoop.lakefs.api.access_key`: lakeFS API access key.
  • Spark config `spark.hadoop.lakefs.api.secret_key`: lakeFS API secret key.

Quick Install

# Pull the Spark GC image
docker pull treeverse/bitnami-spark:3.2.3

# Run GC spark-submit (example)
docker run --network host --add-host lakefs:127.0.0.1 \
  -e AWS_ACCESS_KEY_ID -e AWS_SECRET_ACCESS_KEY \
  -v /path/to/client.jar:/opt/metaclient/client.jar \
  --rm treeverse/bitnami-spark:3.2.3 spark-submit \
  --master spark://localhost:7077 \
  --conf spark.hadoop.lakefs.api.url=http://lakefs:8000/api/v1 \
  --conf spark.hadoop.lakefs.api.access_key=YOUR_KEY \
  --conf spark.hadoop.lakefs.api.secret_key=YOUR_SECRET \
  --class io.treeverse.gc.GarbageCollection \
  /opt/metaclient/client.jar

Code Evidence

Docker container arguments from `esti/gc_test_utils.go:26-35`:

func getDockerArgs(workingDirectory string, localJar string) []string {
    return []string{
        "run", "--network", "host", "--add-host", "lakefs:127.0.0.1",
        "-v", fmt.Sprintf("%s/ivy:/opt/bitnami/spark/.ivy2", workingDirectory),
        "-v", fmt.Sprintf("%s:/opt/metaclient/client.jar", localJar),
        "--rm",
        "-e", "AWS_ACCESS_KEY_ID",
        "-e", "AWS_SECRET_ACCESS_KEY",
    }
}

Spark submit arguments from `esti/gc_test_utils.go:15-24`:

func getSparkSubmitArgs(entryPoint string) []string {
    return []string{
        "--master", "spark://localhost:7077",
        "--conf", "spark.driver.extraJavaOptions=-Divy.cache.dir=/tmp -Divy.home=/tmp",
        "--conf", "spark.hadoop.lakefs.api.url=http://lakefs:8000" + apiutil.BaseURL,
        "--conf", "spark.hadoop.lakefs.api.access_key=AKIAIOSFDNN7EXAMPLEQ",
        "--conf", "spark.hadoop.lakefs.api.secret_key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
        "--class", entryPoint,
    }
}

RunSparkSubmit orchestration from `esti/gc_test_utils.go:107-124`:

func RunSparkSubmit(config *SparkSubmitConfig) error {
    workingDirectory, err := os.Getwd()
    // ...
    dockerArgs := getDockerArgs(workingDirectory, config.LocalJar)
    dockerArgs = append(dockerArgs, fmt.Sprintf("docker.io/treeverse/bitnami-spark:%s", config.SparkVersion), "spark-submit")
    sparkSubmitArgs := getSparkSubmitArgs(config.EntryPoint)
    // ...
    cmd := exec.Command("docker", args...)
    return runCommand(config.LogSource, cmd)
}

Common Errors

Error Message Cause Solution
`docker: command not found` Docker not installed Install Docker Engine on the host
Spark connection refused to lakefs:8000 lakeFS server not running or host networking issue Ensure lakeFS is running and `--network host --add-host lakefs:127.0.0.1` is set
`ClassNotFoundException` Wrong entry point class Verify `--class` matches the GC JAR contents
AWS credentials error Missing AWS_ACCESS_KEY_ID/SECRET Export AWS credentials before running docker command

Compatibility Notes

  • Spark version: Pinned to 3.2.3 for compatibility with the lakeFS Hadoop connector. Do not upgrade without testing.
  • Java version: Spark 3.2 requires Java 8 or 11. The container image includes Java 8.
  • Host networking: Required because the Spark container needs to reach the lakeFS server on localhost.
  • Ivy cache: Mounted from the working directory to avoid re-downloading dependencies on each run.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment