Principle:Apache Hudi Feature Exploration
| Knowledge Sources | |
|---|---|
| Domains | DevOps, Development_Environment |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Providing an interactive environment for hands-on exploration of Apache Hudi features through Jupyter notebooks, Spark shells, and query engines within a containerized data lakehouse stack.
Description
The Hudi Feature Exploration principle addresses the need for developers, data engineers, and evaluators to experience Hudi's capabilities in a self-contained, reproducible environment without requiring access to a production cluster. The exploration environment bundles together multiple interactive tools that allow users to perform CRUD operations on Hudi tables, explore query types, test schema evolution, and experiment with SQL procedures.
The exploration environment consists of several components:
JupyterLab Notebooks: Five pre-built notebooks provide structured, step-by-step tutorials covering core Hudi features:
| Notebook | Topic |
|---|---|
01-crud-operations.ipynb |
Insert, update, delete, and upsert operations on Hudi tables |
02-query-types.ipynb |
Snapshot, incremental, and read-optimized query modes |
03-scd-type2_and_type4.ipynb |
Slowly Changing Dimension patterns (Type 2 and Type 4) |
04-schema-evolution.ipynb |
Adding, renaming, and dropping columns with backward compatibility |
05-mastering-sql-procedures.ipynb |
Hudi's SQL call procedures for table management |
Spark Shell: Direct access to the Spark REPL with Hudi dependencies pre-configured, allowing ad-hoc data manipulation and querying.
Trino Query Engine: A distributed SQL engine connected to the Hive Metastore, enabling standard SQL queries against Hudi tables. Trino provides an alternative query path for users familiar with the Trino/Presto ecosystem.
MinIO Object Storage: An S3-compatible object store that serves as the underlying storage layer, mimicking cloud-based data lakehouse architectures that use S3 as persistent storage.
The environment is managed through the run_spark_hudi.sh script, which provides start, stop, and restart subcommands. This script uses Docker Compose to orchestrate the notebook container alongside its dependencies (Hive Metastore, MinIO, Trino).
Usage
Apply this principle:
- When evaluating Hudi for a new project and wanting to understand its feature set
- When learning Hudi concepts through hands-on experimentation
- When developing new Hudi notebooks or tutorials
- When demonstrating Hudi capabilities to stakeholders
- When testing Hudi behavior with different table types, query modes, or schema changes
Theoretical Basis
Interactive Computing with Jupyter:
Jupyter notebooks implement the literate programming paradigm, interleaving executable code cells with rich-text documentation. For data engineering tools like Hudi, this format is particularly effective because it allows users to see the code that creates or queries a table alongside the resulting output, making cause-and-effect relationships immediately visible. Each notebook cell executes independently but shares state (SparkSession, variables, tables) with other cells in the same kernel session.
Data Lakehouse Architecture:
The exploration environment instantiates a minimal data lakehouse -- a pattern that combines the scalability of data lakes with the transactional guarantees of data warehouses. In this architecture:
- MinIO acts as the object storage layer (analogous to S3 in production)
- Hudi provides ACID transactions, time-travel queries, and incremental processing on top of the object store
- Hive Metastore maintains a catalog of tables and their schemas
- Trino enables federated SQL queries across the catalog
Service Orchestration:
The docker-compose.yml for the notebook environment defines five services with explicit dependency ordering:
spark-hudi (depends_on: hive-metastore, minio)
|
+-- hive-metastore (standalone)
|
+-- minio (standalone)
| |
| +-- mc (depends_on: minio, initializes buckets)
|
trino (depends_on: hive-metastore, minio)
The mc (MinIO Client) container runs an initialization entrypoint that creates the warehouse bucket with a public access policy, ensuring that Spark and Trino can read and write data without additional credential configuration.
Port Mapping Strategy:
The environment exposes well-known ports for each service, allowing users to access web UIs directly from the host:
- 8888 -- JupyterLab (primary interaction point)
- 4040 -- Spark Application UI (job monitoring)
- 9000/9001 -- MinIO S3 API and Console
- 8085 -- Trino Web UI (mapped from container port 8080 to avoid conflict with Spark Master)