Principle:Apache Hudi Demo Environment Startup
| Knowledge Sources | |
|---|---|
| Domains | DevOps, Development_Environment |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Orchestrating the startup of a complete multi-service Hadoop/Spark/Hive cluster within Docker containers to provide a fully functional Apache Hudi demonstration environment.
Description
The Demo Environment Startup principle encompasses the end-to-end process of bringing up the 13-service Docker cluster that forms the Hudi demo environment. This is not a simple single-container launch; it involves coordinated startup of interdependent services spanning HDFS storage, Hive metadata management, Spark compute, message queuing (Kafka/Zookeeper), and object storage (MinIO).
The startup process follows a specific sequence:
Phase 1: Teardown of previous state. Any previously running demo containers are stopped and removed via docker compose down. This ensures a clean starting state and prevents port conflicts or stale container state from interfering.
Phase 2: Image acquisition. In the default (non-dev) mode, images are pulled from Docker Hub. In dev mode, this step is skipped and locally-built images are used instead. This dual-mode design supports both end-users who want a quick demo and developers who are testing code changes.
Phase 3: Container orchestration. Docker Compose reads the architecture-appropriate YAML configuration file and starts all 13 services in detached mode. The compose file defines inter-service dependencies, port mappings, volume mounts, environment variables, and health checks that govern startup ordering.
Phase 4: In-container setup. After the containers are running (with a 15-second stabilization delay), the setup_demo_container.sh script is executed inside the adhoc-1 and adhoc-2 containers. This script copies Spark configuration files, creates HDFS directories (/var/demo/ and /tmp/spark-events), uploads demo configuration to HDFS, and sets executable permissions on the Hive sync tool.
The 13 Services:
| Service | Image | Primary Role | Key Port(s) |
|---|---|---|---|
| namenode | hudi-hadoop_3.3.4-namenode | HDFS metadata management | 9870 (Web UI), 8020 (IPC) |
| datanode1 | hudi-hadoop_3.3.4-datanode | HDFS block storage | 9864 (Web UI) |
| historyserver | hudi-hadoop_3.3.4-history | MapReduce job history | 8188, 19888 |
| hive-metastore-postgresql | bde2020/hive-metastore-postgresql | Hive metastore backing DB | 5432 |
| hivemetastore | hudi-hadoop_3.3.4-hive_3.1.3 | Hive Metastore Service | 9083 |
| hiveserver | hudi-hadoop_3.3.4-hive_3.1.3 | HiveServer2 (SQL interface) | 10000, 10002 |
| zookeeper | bitnami/zookeeper | Distributed coordination | 2181 |
| kafkabroker | apache/kafka | Message streaming | 9092 |
| sparkmaster | hudi-hadoop_3.3.4-hive_3.1.3-sparkmaster_3.5.3 | Spark cluster manager | 8080, 7077, 8888 |
| spark-worker-1 | hudi-hadoop_3.3.4-hive_3.1.3-sparkworker_3.5.3 | Spark task execution | 8081 |
| adhoc-1 | hudi-hadoop_3.3.4-hive_3.1.3-sparkadhoc_3.5.3 | Ad-hoc Spark queries | 4040 |
| adhoc-2 | hudi-hadoop_3.3.4-hive_3.1.3-sparkadhoc_3.5.3 | Ad-hoc Spark queries | 5005 |
| minio + mc | minio/minio, minio/mc | S3-compatible object storage | 9090, 9091 |
Usage
Apply this principle:
- When starting the Hudi demo for the first time on a new machine
- When restarting the demo after making configuration changes
- When switching between
devmode (local images) and default mode (Docker Hub images) - When debugging service startup failures by understanding the expected startup sequence
Theoretical Basis
Docker Compose Orchestration:
Docker Compose manages multi-container applications through a declarative YAML specification. The Hudi demo compose file defines service dependencies using depends_on and links directives, which control startup ordering. For example, datanode1 depends on namenode, ensuring the HDFS NameNode is running before DataNodes attempt to register. Similarly, hiveserver depends on hivemetastore, which in turn depends on hive-metastore-postgresql.
Architecture-Aware Configuration:
The startup script selects between two compose files based on the host architecture:
COMPOSE_FILE_NAME="docker-compose_hadoop334_hive313_spark353_amd64.yml"
if [ "$(uname -m)" = "arm64" ]; then
COMPOSE_FILE_NAME="docker-compose_hadoop334_hive313_spark353_arm64.yml"
fi
This ensures that the correct platform-specific images are used without requiring manual configuration. The amd64 and arm64 compose files differ primarily in their platform: directives and potentially in image tags.
Volume Mounting for Development:
The compose configuration mounts the Hudi workspace (${HUDI_WS}) into several containers at /var/hoodie/ws. This bind mount allows containers to access the latest Hudi source code and built artifacts directly from the host filesystem, enabling a rapid edit-test cycle without rebuilding images.
Health Checks and Stabilization:
The compose file includes health check definitions for critical services (NameNode, DataNode, HistoryServer, HiveMetastore). These health checks use HTTP or TCP probes to verify service readiness. The 15-second delay in the startup script provides additional stabilization time for services that may pass health checks before being fully operational. The subsequent in-container setup (Phase 4) depends on HDFS being accessible, which requires both the NameNode and DataNode to be healthy.