Principle:Apache Hudi Demo Environment Startup

Knowledge Sources	Apache Hudi Docker Documentation
Domains	DevOps, Development_Environment
Last Updated	2026-02-08 00:00 GMT

Overview

Orchestrating the startup of a complete multi-service Hadoop/Spark/Hive cluster within Docker containers to provide a fully functional Apache Hudi demonstration environment.

Description

The Demo Environment Startup principle encompasses the end-to-end process of bringing up the 13-service Docker cluster that forms the Hudi demo environment. This is not a simple single-container launch; it involves coordinated startup of interdependent services spanning HDFS storage, Hive metadata management, Spark compute, message queuing (Kafka/Zookeeper), and object storage (MinIO).

The startup process follows a specific sequence:

Phase 1: Teardown of previous state. Any previously running demo containers are stopped and removed via docker compose down. This ensures a clean starting state and prevents port conflicts or stale container state from interfering.

Phase 2: Image acquisition. In the default (non-dev) mode, images are pulled from Docker Hub. In dev mode, this step is skipped and locally-built images are used instead. This dual-mode design supports both end-users who want a quick demo and developers who are testing code changes.

Phase 3: Container orchestration. Docker Compose reads the architecture-appropriate YAML configuration file and starts all 13 services in detached mode. The compose file defines inter-service dependencies, port mappings, volume mounts, environment variables, and health checks that govern startup ordering.

Phase 4: In-container setup. After the containers are running (with a 15-second stabilization delay), the setup_demo_container.sh script is executed inside the adhoc-1 and adhoc-2 containers. This script copies Spark configuration files, creates HDFS directories (/var/demo/ and /tmp/spark-events), uploads demo configuration to HDFS, and sets executable permissions on the Hive sync tool.

The 13 Services:

Service	Image	Primary Role	Key Port(s)
namenode	hudi-hadoop_3.3.4-namenode	HDFS metadata management	9870 (Web UI), 8020 (IPC)
datanode1	hudi-hadoop_3.3.4-datanode	HDFS block storage	9864 (Web UI)
historyserver	hudi-hadoop_3.3.4-history	MapReduce job history	8188, 19888
hive-metastore-postgresql	bde2020/hive-metastore-postgresql	Hive metastore backing DB	5432
hivemetastore	hudi-hadoop_3.3.4-hive_3.1.3	Hive Metastore Service	9083
hiveserver	hudi-hadoop_3.3.4-hive_3.1.3	HiveServer2 (SQL interface)	10000, 10002
zookeeper	bitnami/zookeeper	Distributed coordination	2181
kafkabroker	apache/kafka	Message streaming	9092
sparkmaster	hudi-hadoop_3.3.4-hive_3.1.3-sparkmaster_3.5.3	Spark cluster manager	8080, 7077, 8888
spark-worker-1	hudi-hadoop_3.3.4-hive_3.1.3-sparkworker_3.5.3	Spark task execution	8081
adhoc-1	hudi-hadoop_3.3.4-hive_3.1.3-sparkadhoc_3.5.3	Ad-hoc Spark queries	4040
adhoc-2	hudi-hadoop_3.3.4-hive_3.1.3-sparkadhoc_3.5.3	Ad-hoc Spark queries	5005
minio + mc	minio/minio, minio/mc	S3-compatible object storage	9090, 9091

Usage

Apply this principle:

When starting the Hudi demo for the first time on a new machine
When restarting the demo after making configuration changes
When switching between dev mode (local images) and default mode (Docker Hub images)
When debugging service startup failures by understanding the expected startup sequence

Theoretical Basis

Docker Compose Orchestration:

Docker Compose manages multi-container applications through a declarative YAML specification. The Hudi demo compose file defines service dependencies using depends_on and links directives, which control startup ordering. For example, datanode1 depends on namenode, ensuring the HDFS NameNode is running before DataNodes attempt to register. Similarly, hiveserver depends on hivemetastore, which in turn depends on hive-metastore-postgresql.

Architecture-Aware Configuration:

The startup script selects between two compose files based on the host architecture:

COMPOSE_FILE_NAME="docker-compose_hadoop334_hive313_spark353_amd64.yml"
if [ "$(uname -m)" = "arm64" ]; then
  COMPOSE_FILE_NAME="docker-compose_hadoop334_hive313_spark353_arm64.yml"
fi

This ensures that the correct platform-specific images are used without requiring manual configuration. The amd64 and arm64 compose files differ primarily in their platform: directives and potentially in image tags.

Volume Mounting for Development:

The compose configuration mounts the Hudi workspace (${HUDI_WS}) into several containers at /var/hoodie/ws. This bind mount allows containers to access the latest Hudi source code and built artifacts directly from the host filesystem, enabling a rapid edit-test cycle without rebuilding images.

Health Checks and Stabilization:

The compose file includes health check definitions for critical services (NameNode, DataNode, HistoryServer, HiveMetastore). These health checks use HTTP or TCP probes to verify service readiness. The 15-second delay in the startup script provides additional stabilization time for services that may pass health checks before being fully operational. The subsequent in-container setup (Phase 4) depends on HDFS being accessible, which requires both the NameNode and DataNode to be healthy.

Related Pages

Implemented By

Implementation:Apache_Hudi_Setup_Demo_Script

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment