Environment:DataExpert io Data engineer handbook Spark Iceberg Docker Environment
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, Big_Data |
| Last Updated | 2026-02-09 06:00 GMT |
Overview
Docker Compose environment with Apache Spark, Iceberg catalog, MinIO object storage, and Jupyter notebooks for PySpark development.
Description
This environment provides a multi-container stack for Apache Spark with Iceberg table format support. It includes a Spark-Iceberg runtime with Jupyter notebook access, an Iceberg REST catalog service, MinIO for S3-compatible object storage, and a MinIO client for bucket initialization. The stack is connected via a dedicated Docker network (`iceberg_net`) and pre-configured with AWS-compatible credentials for the MinIO storage layer.
Usage
Use this environment for any PySpark or Iceberg workflow that requires a local lakehouse development environment. It is the mandatory prerequisite for running the SparkSession_Builder, Do_player_scd_transformation, Do_monthly_user_site_hits_transformation, Do_team_vertex_transformation, and DataFrame_Write_InsertInto implementations.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux, macOS, or Windows with WSL2 | Docker Desktop required |
| Software | Docker Engine + Docker Compose | v2.x recommended |
| Memory | Minimum 4GB allocated to Docker | Spark requires significant heap space |
| Disk | ~2GB | Container images + warehouse data |
| Network | Ports 8888, 8080, 8181, 9000, 9001, 10000, 10001, 4040-4042 | Must be available on host |
Dependencies
System Packages
- Docker Engine with Docker Compose v2
- Apache Spark (via `tabulario/spark-iceberg` image)
- Iceberg REST Catalog (via `tabulario/iceberg-rest` image)
- MinIO (via `minio/minio` image)
Container Images
- `tabulario/spark-iceberg` — Spark with Iceberg integration and Jupyter
- `tabulario/iceberg-rest` — Iceberg REST catalog server
- `minio/minio` — S3-compatible object storage
- `minio/mc` — MinIO client for bucket initialization
Python Packages (for local testing)
- `pyspark` (with `pyspark[sql]`)
- `chispa` — DataFrame equality assertions for testing
- `pytest` — Test runner
Credentials
The following credentials are pre-configured in the Docker Compose stack:
- `AWS_ACCESS_KEY_ID`: MinIO access key (set to `admin` in compose)
- `AWS_SECRET_ACCESS_KEY`: MinIO secret key (set to `password` in compose)
- `AWS_REGION`: AWS region for S3 compatibility (set to `us-east-1`)
- `MINIO_ROOT_USER`: MinIO admin user (set to `admin`)
- `MINIO_ROOT_PASSWORD`: MinIO admin password (set to `password`)
Quick Install
# Start the full Spark-Iceberg-MinIO stack
docker compose up -d
# Wait for all services to be healthy
docker compose ps
# Access Jupyter notebooks at http://localhost:8888
# For local test development
pip install pyspark chispa pytest
Code Evidence
Spark-Iceberg service definition from `docker-compose.yaml:5-22`:
spark-iceberg:
image: tabulario/spark-iceberg
container_name: spark-iceberg
build: spark/
depends_on:
- rest
- minio
volumes:
- ./warehouse:/home/iceberg/warehouse
- ./notebooks:/home/iceberg/notebooks/notebooks
- ./data:/home/iceberg/data
environment:
- AWS_ACCESS_KEY_ID=admin
- AWS_SECRET_ACCESS_KEY=password
- AWS_REGION=us-east-1
ports:
- 8888:8888
- 8080:8080
- 10000:10000
- 10001:10001
MinIO bucket initialization from `docker-compose.yaml:52-62`:
/bin/sh -c "
until (/usr/bin/mc alias set minio http://minio:9000/ admin password) do
echo '...waiting...' && sleep 1;
done;
/usr/bin/mc mb minio/warehouse || echo 'Warehouse already exists';
/usr/bin/mc policy set public minio/warehouse;
tail -f /dev/null
"
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `OutOfMemoryError: Java heap space` | Insufficient memory for Spark | Increase Docker memory allocation to at least 4GB |
| `Connection refused` on port 8888 | Spark-Iceberg container not ready | Wait for all containers to start; check `docker compose logs spark-iceberg` |
| `Bucket already exists` | MinIO bucket initialization on restart | This is informational, not an error; the `mc` container handles this gracefully |
Compatibility Notes
- Port Conflicts: The stack uses many ports (8888, 8080, 8181, 9000, 9001, 10000-10001, 4040-4042). Ensure no other services occupy these ports.
- Spark UI: Available at `http://localhost:4040` during active Spark sessions.
- Data Persistence: Warehouse data is stored in `./warehouse` on the host filesystem. MinIO data does not persist by default between container restarts unless a volume is added.
- Network: All services communicate via the `iceberg_net` Docker network.
Related Pages
- Implementation:DataExpert_io_Data_engineer_handbook_SparkSession_Builder
- Implementation:DataExpert_io_Data_engineer_handbook_Do_player_scd_transformation
- Implementation:DataExpert_io_Data_engineer_handbook_Do_monthly_user_site_hits_transformation
- Implementation:DataExpert_io_Data_engineer_handbook_Do_team_vertex_transformation
- Implementation:DataExpert_io_Data_engineer_handbook_DataFrame_Write_InsertInto