Environment:Treeverse LakeFS Spark GC Environment
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, Data_Platform, Distributed_Computing |
| Last Updated | 2026-02-08 10:00 GMT |
Overview
Apache Spark 3.2+ environment with Java 8, Docker, and AWS credentials for running lakeFS garbage collection jobs.
Description
The lakeFS garbage collection workflow requires an external Spark job to sweep unreferenced objects from the underlying object storage. This environment provides the Spark runtime via Docker (treeverse/bitnami-spark:3.2.3), Java 8 for the Spark executor, and network access to both the lakeFS API and the object storage backend. The GC job runs as a `spark-submit` command inside a Docker container that connects to the lakeFS server over the host network.
Usage
Use this environment when executing the garbage collection workflow (WF5). It is the mandatory prerequisite for the Implementation:Treeverse_LakeFS_RunSparkSubmit implementation. The Spark job reads the GC preparation metadata from the lakeFS API and deletes unreferenced objects from the storage backend.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux (Docker host) | Docker required for Spark container |
| Hardware | 4+ CPU cores, 8GB+ RAM | Spark driver and executor memory requirements |
| Disk | 10GB+ free space | For Spark working directory, Ivy cache, and temporary files |
| Network | Host networking mode | Container uses `--network host` for lakeFS and storage access |
Dependencies
System Packages
- `docker` (Docker Engine for running Spark container)
Container Image
- `treeverse/bitnami-spark:3.2.3` (Spark 3.2.1 with Hadoop support)
Java Runtime
- Java 8 (OpenJDK 8u242 in Hive metastore; Spark image includes its own JVM)
Spark Configuration
- Master: `spark://localhost:7077`
- Hadoop-lakeFS connector JAR (mounted as `/opt/metaclient/client.jar`)
Credentials
The following environment variables must be available to the Docker container:
- `AWS_ACCESS_KEY_ID`: AWS access key for object storage operations.
- `AWS_SECRET_ACCESS_KEY`: AWS secret key for object storage operations.
- Spark config `spark.hadoop.lakefs.api.access_key`: lakeFS API access key.
- Spark config `spark.hadoop.lakefs.api.secret_key`: lakeFS API secret key.
Quick Install
# Pull the Spark GC image
docker pull treeverse/bitnami-spark:3.2.3
# Run GC spark-submit (example)
docker run --network host --add-host lakefs:127.0.0.1 \
-e AWS_ACCESS_KEY_ID -e AWS_SECRET_ACCESS_KEY \
-v /path/to/client.jar:/opt/metaclient/client.jar \
--rm treeverse/bitnami-spark:3.2.3 spark-submit \
--master spark://localhost:7077 \
--conf spark.hadoop.lakefs.api.url=http://lakefs:8000/api/v1 \
--conf spark.hadoop.lakefs.api.access_key=YOUR_KEY \
--conf spark.hadoop.lakefs.api.secret_key=YOUR_SECRET \
--class io.treeverse.gc.GarbageCollection \
/opt/metaclient/client.jar
Code Evidence
Docker container arguments from `esti/gc_test_utils.go:26-35`:
func getDockerArgs(workingDirectory string, localJar string) []string {
return []string{
"run", "--network", "host", "--add-host", "lakefs:127.0.0.1",
"-v", fmt.Sprintf("%s/ivy:/opt/bitnami/spark/.ivy2", workingDirectory),
"-v", fmt.Sprintf("%s:/opt/metaclient/client.jar", localJar),
"--rm",
"-e", "AWS_ACCESS_KEY_ID",
"-e", "AWS_SECRET_ACCESS_KEY",
}
}
Spark submit arguments from `esti/gc_test_utils.go:15-24`:
func getSparkSubmitArgs(entryPoint string) []string {
return []string{
"--master", "spark://localhost:7077",
"--conf", "spark.driver.extraJavaOptions=-Divy.cache.dir=/tmp -Divy.home=/tmp",
"--conf", "spark.hadoop.lakefs.api.url=http://lakefs:8000" + apiutil.BaseURL,
"--conf", "spark.hadoop.lakefs.api.access_key=AKIAIOSFDNN7EXAMPLEQ",
"--conf", "spark.hadoop.lakefs.api.secret_key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
"--class", entryPoint,
}
}
RunSparkSubmit orchestration from `esti/gc_test_utils.go:107-124`:
func RunSparkSubmit(config *SparkSubmitConfig) error {
workingDirectory, err := os.Getwd()
// ...
dockerArgs := getDockerArgs(workingDirectory, config.LocalJar)
dockerArgs = append(dockerArgs, fmt.Sprintf("docker.io/treeverse/bitnami-spark:%s", config.SparkVersion), "spark-submit")
sparkSubmitArgs := getSparkSubmitArgs(config.EntryPoint)
// ...
cmd := exec.Command("docker", args...)
return runCommand(config.LogSource, cmd)
}
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `docker: command not found` | Docker not installed | Install Docker Engine on the host |
| Spark connection refused to lakefs:8000 | lakeFS server not running or host networking issue | Ensure lakeFS is running and `--network host --add-host lakefs:127.0.0.1` is set |
| `ClassNotFoundException` | Wrong entry point class | Verify `--class` matches the GC JAR contents |
| AWS credentials error | Missing AWS_ACCESS_KEY_ID/SECRET | Export AWS credentials before running docker command |
Compatibility Notes
- Spark version: Pinned to 3.2.3 for compatibility with the lakeFS Hadoop connector. Do not upgrade without testing.
- Java version: Spark 3.2 requires Java 8 or 11. The container image includes Java 8.
- Host networking: Required because the Spark container needs to reach the lakeFS server on localhost.
- Ivy cache: Mounted from the working directory to avoid re-downloading dependencies on each run.