Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Apache Hudi Demo Environment Startup

From Leeroopedia


Knowledge Sources
Domains DevOps, Development_Environment
Last Updated 2026-02-08 00:00 GMT

Overview

Orchestrating the startup of a complete multi-service Hadoop/Spark/Hive cluster within Docker containers to provide a fully functional Apache Hudi demonstration environment.

Description

The Demo Environment Startup principle encompasses the end-to-end process of bringing up the 13-service Docker cluster that forms the Hudi demo environment. This is not a simple single-container launch; it involves coordinated startup of interdependent services spanning HDFS storage, Hive metadata management, Spark compute, message queuing (Kafka/Zookeeper), and object storage (MinIO).

The startup process follows a specific sequence:

Phase 1: Teardown of previous state. Any previously running demo containers are stopped and removed via docker compose down. This ensures a clean starting state and prevents port conflicts or stale container state from interfering.

Phase 2: Image acquisition. In the default (non-dev) mode, images are pulled from Docker Hub. In dev mode, this step is skipped and locally-built images are used instead. This dual-mode design supports both end-users who want a quick demo and developers who are testing code changes.

Phase 3: Container orchestration. Docker Compose reads the architecture-appropriate YAML configuration file and starts all 13 services in detached mode. The compose file defines inter-service dependencies, port mappings, volume mounts, environment variables, and health checks that govern startup ordering.

Phase 4: In-container setup. After the containers are running (with a 15-second stabilization delay), the setup_demo_container.sh script is executed inside the adhoc-1 and adhoc-2 containers. This script copies Spark configuration files, creates HDFS directories (/var/demo/ and /tmp/spark-events), uploads demo configuration to HDFS, and sets executable permissions on the Hive sync tool.

The 13 Services:

Service Image Primary Role Key Port(s)
namenode hudi-hadoop_3.3.4-namenode HDFS metadata management 9870 (Web UI), 8020 (IPC)
datanode1 hudi-hadoop_3.3.4-datanode HDFS block storage 9864 (Web UI)
historyserver hudi-hadoop_3.3.4-history MapReduce job history 8188, 19888
hive-metastore-postgresql bde2020/hive-metastore-postgresql Hive metastore backing DB 5432
hivemetastore hudi-hadoop_3.3.4-hive_3.1.3 Hive Metastore Service 9083
hiveserver hudi-hadoop_3.3.4-hive_3.1.3 HiveServer2 (SQL interface) 10000, 10002
zookeeper bitnami/zookeeper Distributed coordination 2181
kafkabroker apache/kafka Message streaming 9092
sparkmaster hudi-hadoop_3.3.4-hive_3.1.3-sparkmaster_3.5.3 Spark cluster manager 8080, 7077, 8888
spark-worker-1 hudi-hadoop_3.3.4-hive_3.1.3-sparkworker_3.5.3 Spark task execution 8081
adhoc-1 hudi-hadoop_3.3.4-hive_3.1.3-sparkadhoc_3.5.3 Ad-hoc Spark queries 4040
adhoc-2 hudi-hadoop_3.3.4-hive_3.1.3-sparkadhoc_3.5.3 Ad-hoc Spark queries 5005
minio + mc minio/minio, minio/mc S3-compatible object storage 9090, 9091

Usage

Apply this principle:

  • When starting the Hudi demo for the first time on a new machine
  • When restarting the demo after making configuration changes
  • When switching between dev mode (local images) and default mode (Docker Hub images)
  • When debugging service startup failures by understanding the expected startup sequence

Theoretical Basis

Docker Compose Orchestration:

Docker Compose manages multi-container applications through a declarative YAML specification. The Hudi demo compose file defines service dependencies using depends_on and links directives, which control startup ordering. For example, datanode1 depends on namenode, ensuring the HDFS NameNode is running before DataNodes attempt to register. Similarly, hiveserver depends on hivemetastore, which in turn depends on hive-metastore-postgresql.

Architecture-Aware Configuration:

The startup script selects between two compose files based on the host architecture:

COMPOSE_FILE_NAME="docker-compose_hadoop334_hive313_spark353_amd64.yml"
if [ "$(uname -m)" = "arm64" ]; then
  COMPOSE_FILE_NAME="docker-compose_hadoop334_hive313_spark353_arm64.yml"
fi

This ensures that the correct platform-specific images are used without requiring manual configuration. The amd64 and arm64 compose files differ primarily in their platform: directives and potentially in image tags.

Volume Mounting for Development:

The compose configuration mounts the Hudi workspace (${HUDI_WS}) into several containers at /var/hoodie/ws. This bind mount allows containers to access the latest Hudi source code and built artifacts directly from the host filesystem, enabling a rapid edit-test cycle without rebuilding images.

Health Checks and Stabilization:

The compose file includes health check definitions for critical services (NameNode, DataNode, HistoryServer, HiveMetastore). These health checks use HTTP or TCP probes to verify service readiness. The 15-second delay in the startup script provides additional stabilization time for services that may pass health checks before being fully operational. The subsequent in-container setup (Phase 4) depends on HDFS being accessible, which requires both the NameNode and DataNode to be healthy.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment