Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:DataExpert io Data engineer handbook Spark Iceberg Docker Environment

From Leeroopedia


Knowledge Sources
Domains Infrastructure, Big_Data
Last Updated 2026-02-09 06:00 GMT

Overview

Docker Compose environment with Apache Spark, Iceberg catalog, MinIO object storage, and Jupyter notebooks for PySpark development.

Description

This environment provides a multi-container stack for Apache Spark with Iceberg table format support. It includes a Spark-Iceberg runtime with Jupyter notebook access, an Iceberg REST catalog service, MinIO for S3-compatible object storage, and a MinIO client for bucket initialization. The stack is connected via a dedicated Docker network (`iceberg_net`) and pre-configured with AWS-compatible credentials for the MinIO storage layer.

Usage

Use this environment for any PySpark or Iceberg workflow that requires a local lakehouse development environment. It is the mandatory prerequisite for running the SparkSession_Builder, Do_player_scd_transformation, Do_monthly_user_site_hits_transformation, Do_team_vertex_transformation, and DataFrame_Write_InsertInto implementations.

System Requirements

Category Requirement Notes
OS Linux, macOS, or Windows with WSL2 Docker Desktop required
Software Docker Engine + Docker Compose v2.x recommended
Memory Minimum 4GB allocated to Docker Spark requires significant heap space
Disk ~2GB Container images + warehouse data
Network Ports 8888, 8080, 8181, 9000, 9001, 10000, 10001, 4040-4042 Must be available on host

Dependencies

System Packages

  • Docker Engine with Docker Compose v2
  • Apache Spark (via `tabulario/spark-iceberg` image)
  • Iceberg REST Catalog (via `tabulario/iceberg-rest` image)
  • MinIO (via `minio/minio` image)

Container Images

  • `tabulario/spark-iceberg` — Spark with Iceberg integration and Jupyter
  • `tabulario/iceberg-rest` — Iceberg REST catalog server
  • `minio/minio` — S3-compatible object storage
  • `minio/mc` — MinIO client for bucket initialization

Python Packages (for local testing)

  • `pyspark` (with `pyspark[sql]`)
  • `chispa` — DataFrame equality assertions for testing
  • `pytest` — Test runner

Credentials

The following credentials are pre-configured in the Docker Compose stack:

  • `AWS_ACCESS_KEY_ID`: MinIO access key (set to `admin` in compose)
  • `AWS_SECRET_ACCESS_KEY`: MinIO secret key (set to `password` in compose)
  • `AWS_REGION`: AWS region for S3 compatibility (set to `us-east-1`)
  • `MINIO_ROOT_USER`: MinIO admin user (set to `admin`)
  • `MINIO_ROOT_PASSWORD`: MinIO admin password (set to `password`)

Quick Install

# Start the full Spark-Iceberg-MinIO stack
docker compose up -d

# Wait for all services to be healthy
docker compose ps

# Access Jupyter notebooks at http://localhost:8888

# For local test development
pip install pyspark chispa pytest

Code Evidence

Spark-Iceberg service definition from `docker-compose.yaml:5-22`:

  spark-iceberg:
    image: tabulario/spark-iceberg
    container_name: spark-iceberg
    build: spark/
    depends_on:
      - rest
      - minio
    volumes:
      - ./warehouse:/home/iceberg/warehouse
      - ./notebooks:/home/iceberg/notebooks/notebooks
      - ./data:/home/iceberg/data
    environment:
      - AWS_ACCESS_KEY_ID=admin
      - AWS_SECRET_ACCESS_KEY=password
      - AWS_REGION=us-east-1
    ports:
      - 8888:8888
      - 8080:8080
      - 10000:10000
      - 10001:10001

MinIO bucket initialization from `docker-compose.yaml:52-62`:

/bin/sh -c "
until (/usr/bin/mc alias set minio http://minio:9000/ admin password) do
  echo '...waiting...' && sleep 1;
done;
/usr/bin/mc mb minio/warehouse || echo 'Warehouse already exists';
/usr/bin/mc policy set public minio/warehouse;
tail -f /dev/null
"

Common Errors

Error Message Cause Solution
`OutOfMemoryError: Java heap space` Insufficient memory for Spark Increase Docker memory allocation to at least 4GB
`Connection refused` on port 8888 Spark-Iceberg container not ready Wait for all containers to start; check `docker compose logs spark-iceberg`
`Bucket already exists` MinIO bucket initialization on restart This is informational, not an error; the `mc` container handles this gracefully

Compatibility Notes

  • Port Conflicts: The stack uses many ports (8888, 8080, 8181, 9000, 9001, 10000-10001, 4040-4042). Ensure no other services occupy these ports.
  • Spark UI: Available at `http://localhost:4040` during active Spark sessions.
  • Data Persistence: Warehouse data is stored in `./warehouse` on the host filesystem. MinIO data does not persist by default between container restarts unless a volume is added.
  • Network: All services communicate via the `iceberg_net` Docker network.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment