Environment:DataExpert io Data engineer handbook Spark Iceberg Docker Environment

Knowledge Sources	DataExpert-io/data-engineer-handbook Apache Iceberg MinIO Object Storage
Domains	Infrastructure, Big_Data
Last Updated	2026-02-09 06:00 GMT

Overview

Docker Compose environment with Apache Spark, Iceberg catalog, MinIO object storage, and Jupyter notebooks for PySpark development.

Description

This environment provides a multi-container stack for Apache Spark with Iceberg table format support. It includes a Spark-Iceberg runtime with Jupyter notebook access, an Iceberg REST catalog service, MinIO for S3-compatible object storage, and a MinIO client for bucket initialization. The stack is connected via a dedicated Docker network (`iceberg_net`) and pre-configured with AWS-compatible credentials for the MinIO storage layer.

Usage

Use this environment for any PySpark or Iceberg workflow that requires a local lakehouse development environment. It is the mandatory prerequisite for running the SparkSession_Builder, Do_player_scd_transformation, Do_monthly_user_site_hits_transformation, Do_team_vertex_transformation, and DataFrame_Write_InsertInto implementations.

System Requirements

Category	Requirement	Notes
OS	Linux, macOS, or Windows with WSL2	Docker Desktop required
Software	Docker Engine + Docker Compose	v2.x recommended
Memory	Minimum 4GB allocated to Docker	Spark requires significant heap space
Disk	~2GB	Container images + warehouse data
Network	Ports 8888, 8080, 8181, 9000, 9001, 10000, 10001, 4040-4042	Must be available on host

Dependencies

System Packages

Docker Engine with Docker Compose v2
Apache Spark (via `tabulario/spark-iceberg` image)
Iceberg REST Catalog (via `tabulario/iceberg-rest` image)
MinIO (via `minio/minio` image)

Container Images

`tabulario/spark-iceberg` — Spark with Iceberg integration and Jupyter
`tabulario/iceberg-rest` — Iceberg REST catalog server
`minio/minio` — S3-compatible object storage
`minio/mc` — MinIO client for bucket initialization

Python Packages (for local testing)

`pyspark` (with `pyspark[sql]`)
`chispa` — DataFrame equality assertions for testing
`pytest` — Test runner

Credentials

The following credentials are pre-configured in the Docker Compose stack:

`AWS_ACCESS_KEY_ID`: MinIO access key (set to `admin` in compose)
`AWS_SECRET_ACCESS_KEY`: MinIO secret key (set to `password` in compose)
`AWS_REGION`: AWS region for S3 compatibility (set to `us-east-1`)
`MINIO_ROOT_USER`: MinIO admin user (set to `admin`)
`MINIO_ROOT_PASSWORD`: MinIO admin password (set to `password`)

Quick Install

# Start the full Spark-Iceberg-MinIO stack
docker compose up -d

# Wait for all services to be healthy
docker compose ps

# Access Jupyter notebooks at http://localhost:8888

# For local test development
pip install pyspark chispa pytest

Code Evidence

Spark-Iceberg service definition from `docker-compose.yaml:5-22`:

  spark-iceberg:
    image: tabulario/spark-iceberg
    container_name: spark-iceberg
    build: spark/
    depends_on:
      - rest
      - minio
    volumes:
      - ./warehouse:/home/iceberg/warehouse
      - ./notebooks:/home/iceberg/notebooks/notebooks
      - ./data:/home/iceberg/data
    environment:
      - AWS_ACCESS_KEY_ID=admin
      - AWS_SECRET_ACCESS_KEY=password
      - AWS_REGION=us-east-1
    ports:
      - 8888:8888
      - 8080:8080
      - 10000:10000
      - 10001:10001

MinIO bucket initialization from `docker-compose.yaml:52-62`:

/bin/sh -c "
until (/usr/bin/mc alias set minio http://minio:9000/ admin password) do
  echo '...waiting...' && sleep 1;
done;
/usr/bin/mc mb minio/warehouse || echo 'Warehouse already exists';
/usr/bin/mc policy set public minio/warehouse;
tail -f /dev/null
"

Common Errors

Error Message	Cause	Solution
`OutOfMemoryError: Java heap space`	Insufficient memory for Spark	Increase Docker memory allocation to at least 4GB
`Connection refused` on port 8888	Spark-Iceberg container not ready	Wait for all containers to start; check `docker compose logs spark-iceberg`
`Bucket already exists`	MinIO bucket initialization on restart	This is informational, not an error; the `mc` container handles this gracefully

Compatibility Notes

Port Conflicts: The stack uses many ports (8888, 8080, 8181, 9000, 9001, 10000-10001, 4040-4042). Ensure no other services occupy these ports.
Spark UI: Available at `http://localhost:4040` during active Spark sessions.
Data Persistence: Warehouse data is stored in `./warehouse` on the host filesystem. MinIO data does not persist by default between container restarts unless a volume is added.
Network: All services communicate via the `iceberg_net` Docker network.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment