Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Apache Hudi Run Spark Hudi Script

From Leeroopedia


Knowledge Sources
Domains DevOps, Development_Environment
Last Updated 2026-02-08 00:00 GMT

Overview

Concrete tool for launching the interactive Hudi feature exploration environment with JupyterLab, Spark, MinIO, and Trino provided by Apache Hudi Docker demo.

Description

The run_spark_hudi.sh script manages the lifecycle of the Hudi notebook environment, a Docker Compose-based stack that provides JupyterLab with pre-configured Spark-Hudi integration, MinIO object storage, Hive Metastore, and Trino query engine. It accepts a single argument (start, stop, or restart) to control the environment lifecycle.

The script first detects which Docker Compose command variant is available (v1 or v2) using the get_docker_compose_cmd() function. It then delegates to Docker Compose for the requested operation. On restart, it performs a down followed by up -d --build, which forces a rebuild of any images defined in the compose file.

The associated docker-compose.yml defines five services:

  • spark-hudi -- The primary container running JupyterLab with Spark and Hudi pre-installed. Exposes ports 8888 (Jupyter), 4040 (Spark UI), 7077 (Spark Master), 8080/8081 (Spark Master/Worker UI), and 18080 (History Server).
  • minio -- S3-compatible object storage with API on port 9000 and console UI on port 9001.
  • mc -- MinIO Client container that initializes the warehouse bucket on startup.
  • hive-metastore -- Hive Metastore service on port 9083 for table catalog management.
  • trino -- Distributed SQL query engine accessible at port 8085 (mapped from container 8080).

All services are connected via a shared Docker network named hudi-datalake.

Usage

Use this script to:

  • Launch the JupyterLab environment for interactive Hudi tutorials
  • Stop the notebook environment to free resources
  • Restart with a fresh state or after configuration changes

Code Reference

Source Location

  • Repository: Apache Hudi
  • File: hudi-notebooks/run_spark_hudi.sh
  • Lines: 18-52
  • Compose File: hudi-notebooks/docker-compose.yml
  • Lines: 19-106
  • Additional Reference: hudi-examples/README.md
  • Lines: 18-50

Script

run_spark_hudi.sh:

#!/bin/bash
state=${1:-"start"}
state=$(echo "$state" | tr '[:upper:]' '[:lower:]')

# ----------------------------------------------------------
# Function to determine which docker compose command to use
# ----------------------------------------------------------
get_docker_compose_cmd() {
    if docker compose version &>/dev/null; then
        echo "docker compose"
    elif docker-compose version &>/dev/null; then
        echo "docker-compose"
    else
        echo "ERROR: Neither 'docker compose' nor 'docker-compose' is installed or available in PATH." >&2
        exit 1
    fi
}

# Detect and assign the correct compose command
DOCKER_COMPOSE_CMD=$(get_docker_compose_cmd)

case "$state" in
  start)
    $DOCKER_COMPOSE_CMD up -d
    ;;
  stop)
    $DOCKER_COMPOSE_CMD down
    ;;
  restart)
    $DOCKER_COMPOSE_CMD down
    $DOCKER_COMPOSE_CMD up -d --build
    ;;
  *)
    echo "Usage: $0 {start|stop|restart}"
    exit 1
esac

docker-compose.yml service definitions (key excerpts):

services:
  spark-hudi:
    image: apachehudi/spark-hudi:latest
    container_name: spark-hudi
    depends_on:
      - hive-metastore
      - minio
    ports:
      - "8888:8888"   # Jupyter
      - "4040:4040"   # Spark UI
      - "7077:7077"   # Spark Master
    environment:
      - AWS_ACCESS_KEY_ID=admin
      - AWS_SECRET_ACCESS_KEY=password
      - AWS_REGION=us-east-1
      - AWS_ENDPOINT_URL=http://minio:9000

  minio:
    image: 'minio/minio:latest'
    command: server /data --console-address ":9001"
    ports:
      - "9000:9000"   # S3 API Port
      - "9001:9001"   # MinIO Console UI

  trino:
    image: apachehudi/trino:latest
    ports:
      - "8085:8080"   # Trino Web UI

I/O Contract

Inputs

Name Type Required Description
$1 (state) String argument No One of start, stop, or restart. Case-insensitive. Defaults to start if omitted.
Docker Compose System binary/plugin Yes Either docker compose (v2) or docker-compose (v1) must be available. Detected automatically by get_docker_compose_cmd().
docker-compose.yml YAML file Yes The compose file in hudi-notebooks/ defining the 5-service stack. Must be in the working directory when the script runs.
Docker images Docker image cache Yes apachehudi/spark-hudi:latest, minio/minio:latest, minio/mc:latest, apachehudi/hive:latest, apachehudi/trino:latest.
SPARK_MASTER Environment variable No Can be set to configure the Spark master URL for examples in hudi-examples/. Defaults to yarn-cluster mode.

Outputs

Name Type Description
JupyterLab at localhost:8888 Web service Interactive notebook environment with 5 pre-built Hudi tutorial notebooks available at /opt/workspace/notebooks/.
Spark UI at localhost:4040 Web service Spark application monitoring dashboard showing jobs, stages, storage, and executors.
MinIO at localhost:9000/9001 Web service S3-compatible API (9000) and web console (9001) for browsing stored Hudi table data.
Trino at localhost:8085 Web service Trino coordinator web UI for monitoring SQL queries executed against Hudi tables.
warehouse bucket MinIO storage Pre-created S3 bucket with public access policy, initialized by the mc container.
hudi-datalake network Docker network Shared network connecting all services for inter-container communication.

Usage Examples

# Navigate to the hudi-notebooks directory
cd hudi-notebooks/

# Start the notebook environment
./run_spark_hudi.sh start

# Alternatively, using explicit case
./run_spark_hudi.sh START

# Open JupyterLab in a browser
# http://localhost:8888

# Open MinIO Console
# http://localhost:9001 (login: admin / password)

# Open Trino Web UI
# http://localhost:8085

# Check running containers
docker ps --filter "network=hudi-datalake"

# Stop the environment
./run_spark_hudi.sh stop

# Restart with rebuild (useful after image changes)
./run_spark_hudi.sh restart

# For Hudi examples using Spark submit:
export SPARK_MASTER=spark://sparkmaster:7077
# Then run example scripts from hudi-examples/

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment