Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Apache Hudi Docker Image Building

From Leeroopedia


Knowledge Sources
Domains DevOps, Development_Environment
Last Updated 2026-02-08 00:00 GMT

Overview

Constructing the set of Docker images that form the Apache Hudi demo cluster, encompassing Hadoop, Hive, Spark, and supporting infrastructure components.

Description

The Docker Image Building principle covers the process of creating the nine container images that constitute the Hudi demo environment. These images are layered in a dependency chain: a base image provides the JDK and Hadoop runtime, and subsequent images build upon it to add Hive, Spark Master, Spark Worker, Spark Ad-hoc, DataNode, NameNode, and History Server functionality.

Hudi provides two distinct approaches to building these images:

1. Direct Docker Build (build_docker_images.sh):

This approach uses the Docker CLI directly to build images from Dockerfiles located in the docker/hoodie/hadoop/ subdirectories. It is the faster method because it bypasses Maven entirely and interacts with the Docker daemon directly. The script auto-detects the host system architecture (amd64 or arm64) using uname -m and sets the DOCKER_DEFAULT_PLATFORM environment variable accordingly. Each image is tagged with both a version tag (1.1.0) and latest.

2. Maven-Based Build (build_local_docker_images.sh):

This approach uses Maven's pre-integration-test lifecycle phase to build Docker images as part of the project's build pipeline. It first compiles all Hudi modules, copies the resulting JARs into the Docker build context directories, and then triggers Docker image assembly. This ensures the images contain the latest locally-built Hudi artifacts, which is essential for development and testing of Hudi code changes.

Image Inventory:

The build process creates nine images, each serving a specific role in the Hadoop/Spark ecosystem:

Image Role
hudi-hadoop_3.3.4-base Base image with JDK and Hadoop libraries
hudi-hadoop_3.3.4-namenode HDFS NameNode (metadata management)
hudi-hadoop_3.3.4-datanode HDFS DataNode (block storage)
hudi-hadoop_3.3.4-history MapReduce History Server
hudi-hadoop_3.3.4-hive_3.1.3 Hive Metastore and HiveServer2
hudi-hadoop_3.3.4-hive_3.1.3-sparkbase_3.5.3 Spark base with Hive integration
hudi-hadoop_3.3.4-hive_3.1.3-sparkmaster_3.5.3 Spark Master node
hudi-hadoop_3.3.4-hive_3.1.3-sparkworker_3.5.3 Spark Worker node
hudi-hadoop_3.3.4-hive_3.1.3-sparkadhoc_3.5.3 Spark ad-hoc query node

Usage

Apply this principle:

  • When contributing code changes to Hudi and needing to test within the Docker demo environment
  • When customizing the demo environment (e.g., adding additional libraries or configuration)
  • When the pre-built images on Docker Hub do not match the branch or version being developed
  • When building images for non-amd64 architectures (e.g., Apple Silicon / arm64)

Theoretical Basis

Docker Image Layering:

Docker images are built from a series of read-only layers, each representing a filesystem change introduced by a Dockerfile instruction. The Hudi images exploit this layering for efficiency: the base image installs the JDK and Hadoop, and all downstream images (FROM apachehudi/hudi-hadoop_3.3.4-base) inherit those layers without duplication. This reduces total disk usage and speeds up builds when only upper layers change.

Multi-Architecture Support:

Modern container registries support multi-architecture manifests, allowing a single image tag to resolve to different platform-specific images. The Hudi build scripts detect the host architecture at build time:

ARCHITECTURE=$(uname -m)
case "$ARCHITECTURE" in
  x86_64|amd64) DOCKER_PLATFORM='linux/amd64' ;;
  aarch64|arm64) DOCKER_PLATFORM='linux/arm64' ;;
  *) echo "Unsupported architecture: $ARCHITECTURE"; exit 1 ;;
esac
export DOCKER_DEFAULT_PLATFORM="$DOCKER_PLATFORM"

This ensures images are built natively for the host platform, avoiding the performance overhead of emulation through QEMU.

Dual Tagging Strategy:

Each image receives both a latest tag and a versioned tag (e.g., 1.1.0). The latest tag provides a stable reference for compose files and scripts, while the versioned tag provides immutable references for reproducible deployments and CI/CD pipelines. This dual-tagging pattern is a Docker best practice that balances convenience with reproducibility.

Build vs. Pull Trade-off:

Building images locally takes several minutes but guarantees the images contain the exact code being developed. Pulling pre-built images from Docker Hub is faster but only provides released versions. The setup_demo.sh script's dev flag bridges this gap: when set, it skips the pull step and uses locally-built images instead.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment