Principle:Apache Hudi Feature Exploration

Knowledge Sources	Apache Hudi Docker Documentation
Domains	DevOps, Development_Environment
Last Updated	2026-02-08 00:00 GMT

Overview

Providing an interactive environment for hands-on exploration of Apache Hudi features through Jupyter notebooks, Spark shells, and query engines within a containerized data lakehouse stack.

Description

The Hudi Feature Exploration principle addresses the need for developers, data engineers, and evaluators to experience Hudi's capabilities in a self-contained, reproducible environment without requiring access to a production cluster. The exploration environment bundles together multiple interactive tools that allow users to perform CRUD operations on Hudi tables, explore query types, test schema evolution, and experiment with SQL procedures.

The exploration environment consists of several components:

JupyterLab Notebooks: Five pre-built notebooks provide structured, step-by-step tutorials covering core Hudi features:

Notebook	Topic
`01-crud-operations.ipynb`	Insert, update, delete, and upsert operations on Hudi tables
`02-query-types.ipynb`	Snapshot, incremental, and read-optimized query modes
`03-scd-type2_and_type4.ipynb`	Slowly Changing Dimension patterns (Type 2 and Type 4)
`04-schema-evolution.ipynb`	Adding, renaming, and dropping columns with backward compatibility
`05-mastering-sql-procedures.ipynb`	Hudi's SQL call procedures for table management

Spark Shell: Direct access to the Spark REPL with Hudi dependencies pre-configured, allowing ad-hoc data manipulation and querying.

Trino Query Engine: A distributed SQL engine connected to the Hive Metastore, enabling standard SQL queries against Hudi tables. Trino provides an alternative query path for users familiar with the Trino/Presto ecosystem.

MinIO Object Storage: An S3-compatible object store that serves as the underlying storage layer, mimicking cloud-based data lakehouse architectures that use S3 as persistent storage.

The environment is managed through the run_spark_hudi.sh script, which provides start, stop, and restart subcommands. This script uses Docker Compose to orchestrate the notebook container alongside its dependencies (Hive Metastore, MinIO, Trino).

Usage

Apply this principle:

When evaluating Hudi for a new project and wanting to understand its feature set
When learning Hudi concepts through hands-on experimentation
When developing new Hudi notebooks or tutorials
When demonstrating Hudi capabilities to stakeholders
When testing Hudi behavior with different table types, query modes, or schema changes

Theoretical Basis

Interactive Computing with Jupyter:

Jupyter notebooks implement the literate programming paradigm, interleaving executable code cells with rich-text documentation. For data engineering tools like Hudi, this format is particularly effective because it allows users to see the code that creates or queries a table alongside the resulting output, making cause-and-effect relationships immediately visible. Each notebook cell executes independently but shares state (SparkSession, variables, tables) with other cells in the same kernel session.

Data Lakehouse Architecture:

The exploration environment instantiates a minimal data lakehouse -- a pattern that combines the scalability of data lakes with the transactional guarantees of data warehouses. In this architecture:

MinIO acts as the object storage layer (analogous to S3 in production)
Hudi provides ACID transactions, time-travel queries, and incremental processing on top of the object store
Hive Metastore maintains a catalog of tables and their schemas
Trino enables federated SQL queries across the catalog

Service Orchestration:

The docker-compose.yml for the notebook environment defines five services with explicit dependency ordering:

spark-hudi (depends_on: hive-metastore, minio)
  |
  +-- hive-metastore (standalone)
  |
  +-- minio (standalone)
  |     |
  |     +-- mc (depends_on: minio, initializes buckets)
  |
trino (depends_on: hive-metastore, minio)

The mc (MinIO Client) container runs an initialization entrypoint that creates the warehouse bucket with a public access policy, ensuring that Spark and Trino can read and write data without additional credential configuration.

Port Mapping Strategy:

The environment exposes well-known ports for each service, allowing users to access web UIs directly from the host:

8888 -- JupyterLab (primary interaction point)
4040 -- Spark Application UI (job monitoring)
9000/9001 -- MinIO S3 API and Console
8085 -- Trino Web UI (mapped from container port 8080 to avoid conflict with Spark Master)

Related Pages

Implemented By

Implementation:Apache_Hudi_Run_Spark_Hudi_Script

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment