Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:DataExpert io Data engineer handbook PySpark Iceberg Job Execution

From Leeroopedia
Revision as of 11:05, 16 February 2026 by Admin (talk | contribs) (Auto-imported from workflows/DataExpert_io_Data_engineer_handbook_PySpark_Iceberg_Job_Execution.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains Data_Engineering, Apache_Spark, Apache_Iceberg, Docker
Last Updated 2026-02-09 06:00 GMT

Overview

End-to-end process for running PySpark transformation jobs against Apache Iceberg tables in a Dockerized Spark cluster environment.

Description

This workflow covers the execution of PySpark data transformation jobs as part of Week 3 (Spark Fundamentals) of the intermediate bootcamp. It provisions a multi-container environment including Apache Spark, Apache Iceberg (table format), and MinIO (S3-compatible object storage) via Docker Compose. The workflow demonstrates three core transformation patterns: Slowly Changing Dimension (SCD) Type 2 processing, monthly aggregation with array handling, and graph vertex generation. Each job follows a consistent pattern of reading from Iceberg tables, applying SQL-based transformations via temporary views, and writing results back to Iceberg.

Usage

Execute this workflow when you need to run distributed PySpark transformation jobs on structured data using the Iceberg table format. This is appropriate when you have completed the dimensional modeling exercises (Weeks 1-2) and are ready to work with distributed processing frameworks on the same data modeling concepts at scale.

Execution Steps

Step 1: Launch_Spark_Environment

Start the Docker Compose stack that provisions the Spark cluster with Iceberg support. The environment includes a Spark master, worker nodes, a Jupyter notebook server, an Iceberg REST catalog, and MinIO for S3-compatible object storage.

Key considerations:

  • The docker-compose.yaml defines services for spark-iceberg, minio, mc (MinIO client), and rest catalog
  • Jupyter notebook becomes accessible at localhost:8888
  • The Spark environment is pre-configured with Iceberg catalog extensions
  • MinIO provides the storage layer backing the Iceberg warehouse

Step 2: Access_Notebook_Interface

Connect to the Jupyter notebook server at localhost:8888 to interact with the Spark cluster. The first notebook to run is event_data_pyspark.ipynb inside the notebooks folder, which introduces the data structures and Spark operations used in the transformation jobs.

Key considerations:

  • Jupyter provides an interactive environment for exploring Spark DataFrames
  • The notebook establishes a SparkSession with Iceberg catalog configuration
  • Use this step to understand the data schema before running batch jobs

Step 3: Configure_Spark_Session

Each PySpark job initializes a SparkSession configured for local execution with the appropriate application name. The session connects to the Iceberg catalog and provides access to the underlying tables for reading and writing.

What happens:

  • SparkSession.builder creates a session with master("local") for development
  • Application name identifies the job in the Spark UI
  • The session provides access to Iceberg-managed tables via spark.table()

Step 4: Execute_Transformation_Jobs

Run the PySpark transformation jobs that implement core data engineering patterns. Three jobs are provided, each following the same architectural pattern: register a DataFrame as a temporary SQL view, execute a SQL transformation query, and write results to an output table.

Available transformations:

  • Player SCD Job: Implements SCD Type 2 change tracking using window functions (LAG, SUM with CASE) to identify streaks where a player's scoring class changes over seasons
  • Monthly Site Hits Job: Aggregates hit arrays by month using COALESCE and array element access (get function), partitioned by date
  • Team Vertex Job: Generates graph vertices from team data for graph-based analytics

Pattern:

  • Register input DataFrame as temporary view
  • Execute SQL transformation via spark.sql()
  • Return transformed DataFrame

Step 5: Write_Results

Persist the transformation results back to Iceberg tables using the DataFrame write API. Results are written in overwrite mode using insertInto, which respects the target table's schema and partitioning.

Key considerations:

  • write.mode("overwrite") replaces existing data in the target table
  • insertInto requires the target table to already exist with a matching schema
  • Iceberg handles schema evolution and partition management transparently

Execution Diagram

GitHub URL

Workflow Repository