Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Workflow:Apache Hudi Docker Demo Setup

From Leeroopedia


Knowledge Sources
Domains Data_Engineering, DevOps, Data_Lake
Last Updated 2026-02-08 20:00 GMT

Overview

End-to-end process for setting up and running the Apache Hudi Docker demo environment, providing a self-contained data lake sandbox with Hadoop, Hive, Spark, Trino, and Hudi pre-configured.

Description

This workflow covers deploying the Hudi Docker demo environment, which provides a fully functional data lakehouse stack for local experimentation. The demo includes HDFS for storage, Hive Metastore for catalog management, Spark for writing and querying Hudi tables, and Trino for interactive SQL queries. The environment is orchestrated via Docker Compose and comes with sample datasets and pre-configured connectors to demonstrate Hudi's core capabilities including upserts, incremental queries, and time-travel.

Usage

Execute this workflow when you want to explore Hudi's features locally without setting up a full distributed environment. The demo is ideal for learning Hudi concepts, testing write operations, experimenting with different table types and query modes, or validating integrations before deploying to a production cluster.

Execution Steps

Step 1: Prerequisites Check

Verify that Docker and Docker Compose are installed and have sufficient resources allocated. The demo requires multiple containers running simultaneously, so ensure adequate memory (recommended 8GB+) and disk space. Verify that the required ports are available and not occupied by other services.

Key considerations:

  • Docker must be installed with at least 8GB memory allocation
  • Docker Compose v2+ is required for the multi-container setup
  • Required ports include those for HDFS NameNode, Hive, Spark, Trino, and Presto
  • Network access may be needed to pull Docker images from Docker Hub

Step 2: Build Docker Images

Build the Hudi Docker demo images either from pre-built images on Docker Hub or from source. The build_local_docker_images.sh script compiles all required images including Hadoop base, Hive base, Spark base, and Trino base with Hudi JARs pre-installed. Alternatively, use build_docker_images.sh to pull the latest pre-built images.

Key considerations:

  • Building from source requires the Hudi project to be compiled first with the integration-tests profile
  • Pre-built images are faster but may not include the latest changes
  • Each image layer includes specific service configuration and Hudi bundle JARs
  • The Maven POM defines image names and repository coordinates

Step 3: Start Demo Environment

Launch the demo environment using the setup_demo.sh script, which invokes Docker Compose with the appropriate configuration. The script starts all required services in dependency order: HDFS first, then Hive Metastore, followed by Spark and query engines. Wait for all services to become healthy before proceeding.

Key considerations:

  • Use setup_demo.sh dev for locally-built images or setup_demo.sh for pre-built
  • Services start in dependency order and may take several minutes to initialize
  • The Docker Compose file defines networking, volume mounts, and environment variables
  • Health checks ensure services are ready before dependent services start

Step 4: Explore Hudi Features

With the demo environment running, use the provided sample scripts and data to explore Hudi's features. Create Hudi tables with different configurations, run upsert and delete operations, execute snapshot and incremental queries, and observe the Hudi timeline and file layout on HDFS.

Key considerations:

  • Sample datasets are pre-loaded in the demo/data directory
  • Spark shell and Spark SQL can be used for write operations
  • Trino and Hive provide SQL query interfaces
  • The HDFS NameNode UI shows the underlying file structure

Step 5: Stop and Cleanup

When done, gracefully stop the demo environment using stop_demo.sh. This stops all containers and optionally removes volumes and networks. To reset the environment for a fresh start, remove all persistent volumes.

Key considerations:

  • Always use the stop script for graceful shutdown
  • Persistent volumes retain data between restarts unless explicitly removed
  • Container logs can be inspected for debugging before stopping
  • The generate_test_suite.sh script can run automated integration tests before teardown

Execution Diagram

GitHub URL

Workflow Repository