Implementation:Apache Hudi Run Spark Hudi Script
| Knowledge Sources | |
|---|---|
| Domains | DevOps, Development_Environment |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Concrete tool for launching the interactive Hudi feature exploration environment with JupyterLab, Spark, MinIO, and Trino provided by Apache Hudi Docker demo.
Description
The run_spark_hudi.sh script manages the lifecycle of the Hudi notebook environment, a Docker Compose-based stack that provides JupyterLab with pre-configured Spark-Hudi integration, MinIO object storage, Hive Metastore, and Trino query engine. It accepts a single argument (start, stop, or restart) to control the environment lifecycle.
The script first detects which Docker Compose command variant is available (v1 or v2) using the get_docker_compose_cmd() function. It then delegates to Docker Compose for the requested operation. On restart, it performs a down followed by up -d --build, which forces a rebuild of any images defined in the compose file.
The associated docker-compose.yml defines five services:
- spark-hudi -- The primary container running JupyterLab with Spark and Hudi pre-installed. Exposes ports 8888 (Jupyter), 4040 (Spark UI), 7077 (Spark Master), 8080/8081 (Spark Master/Worker UI), and 18080 (History Server).
- minio -- S3-compatible object storage with API on port 9000 and console UI on port 9001.
- mc -- MinIO Client container that initializes the
warehousebucket on startup. - hive-metastore -- Hive Metastore service on port 9083 for table catalog management.
- trino -- Distributed SQL query engine accessible at port 8085 (mapped from container 8080).
All services are connected via a shared Docker network named hudi-datalake.
Usage
Use this script to:
- Launch the JupyterLab environment for interactive Hudi tutorials
- Stop the notebook environment to free resources
- Restart with a fresh state or after configuration changes
Code Reference
Source Location
- Repository: Apache Hudi
- File:
hudi-notebooks/run_spark_hudi.sh - Lines: 18-52
- Compose File:
hudi-notebooks/docker-compose.yml - Lines: 19-106
- Additional Reference:
hudi-examples/README.md - Lines: 18-50
Script
run_spark_hudi.sh:
#!/bin/bash
state=${1:-"start"}
state=$(echo "$state" | tr '[:upper:]' '[:lower:]')
# ----------------------------------------------------------
# Function to determine which docker compose command to use
# ----------------------------------------------------------
get_docker_compose_cmd() {
if docker compose version &>/dev/null; then
echo "docker compose"
elif docker-compose version &>/dev/null; then
echo "docker-compose"
else
echo "ERROR: Neither 'docker compose' nor 'docker-compose' is installed or available in PATH." >&2
exit 1
fi
}
# Detect and assign the correct compose command
DOCKER_COMPOSE_CMD=$(get_docker_compose_cmd)
case "$state" in
start)
$DOCKER_COMPOSE_CMD up -d
;;
stop)
$DOCKER_COMPOSE_CMD down
;;
restart)
$DOCKER_COMPOSE_CMD down
$DOCKER_COMPOSE_CMD up -d --build
;;
*)
echo "Usage: $0 {start|stop|restart}"
exit 1
esac
docker-compose.yml service definitions (key excerpts):
services:
spark-hudi:
image: apachehudi/spark-hudi:latest
container_name: spark-hudi
depends_on:
- hive-metastore
- minio
ports:
- "8888:8888" # Jupyter
- "4040:4040" # Spark UI
- "7077:7077" # Spark Master
environment:
- AWS_ACCESS_KEY_ID=admin
- AWS_SECRET_ACCESS_KEY=password
- AWS_REGION=us-east-1
- AWS_ENDPOINT_URL=http://minio:9000
minio:
image: 'minio/minio:latest'
command: server /data --console-address ":9001"
ports:
- "9000:9000" # S3 API Port
- "9001:9001" # MinIO Console UI
trino:
image: apachehudi/trino:latest
ports:
- "8085:8080" # Trino Web UI
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
$1 (state) |
String argument | No | One of start, stop, or restart. Case-insensitive. Defaults to start if omitted.
|
| Docker Compose | System binary/plugin | Yes | Either docker compose (v2) or docker-compose (v1) must be available. Detected automatically by get_docker_compose_cmd().
|
| docker-compose.yml | YAML file | Yes | The compose file in hudi-notebooks/ defining the 5-service stack. Must be in the working directory when the script runs.
|
| Docker images | Docker image cache | Yes | apachehudi/spark-hudi:latest, minio/minio:latest, minio/mc:latest, apachehudi/hive:latest, apachehudi/trino:latest.
|
| SPARK_MASTER | Environment variable | No | Can be set to configure the Spark master URL for examples in hudi-examples/. Defaults to yarn-cluster mode.
|
Outputs
| Name | Type | Description |
|---|---|---|
| JupyterLab at localhost:8888 | Web service | Interactive notebook environment with 5 pre-built Hudi tutorial notebooks available at /opt/workspace/notebooks/.
|
| Spark UI at localhost:4040 | Web service | Spark application monitoring dashboard showing jobs, stages, storage, and executors. |
| MinIO at localhost:9000/9001 | Web service | S3-compatible API (9000) and web console (9001) for browsing stored Hudi table data. |
| Trino at localhost:8085 | Web service | Trino coordinator web UI for monitoring SQL queries executed against Hudi tables. |
warehouse bucket |
MinIO storage | Pre-created S3 bucket with public access policy, initialized by the mc container. |
hudi-datalake network |
Docker network | Shared network connecting all services for inter-container communication. |
Usage Examples
# Navigate to the hudi-notebooks directory
cd hudi-notebooks/
# Start the notebook environment
./run_spark_hudi.sh start
# Alternatively, using explicit case
./run_spark_hudi.sh START
# Open JupyterLab in a browser
# http://localhost:8888
# Open MinIO Console
# http://localhost:9001 (login: admin / password)
# Open Trino Web UI
# http://localhost:8085
# Check running containers
docker ps --filter "network=hudi-datalake"
# Stop the environment
./run_spark_hudi.sh stop
# Restart with rebuild (useful after image changes)
./run_spark_hudi.sh restart
# For Hudi examples using Spark submit:
export SPARK_MASTER=spark://sparkmaster:7077
# Then run example scripts from hudi-examples/