Workflow:DataExpert io Data engineer handbook PySpark Iceberg Job Execution
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, Apache_Spark, Apache_Iceberg, Docker |
| Last Updated | 2026-02-09 06:00 GMT |
Overview
End-to-end process for running PySpark transformation jobs against Apache Iceberg tables in a Dockerized Spark cluster environment.
Description
This workflow covers the execution of PySpark data transformation jobs as part of Week 3 (Spark Fundamentals) of the intermediate bootcamp. It provisions a multi-container environment including Apache Spark, Apache Iceberg (table format), and MinIO (S3-compatible object storage) via Docker Compose. The workflow demonstrates three core transformation patterns: Slowly Changing Dimension (SCD) Type 2 processing, monthly aggregation with array handling, and graph vertex generation. Each job follows a consistent pattern of reading from Iceberg tables, applying SQL-based transformations via temporary views, and writing results back to Iceberg.
Usage
Execute this workflow when you need to run distributed PySpark transformation jobs on structured data using the Iceberg table format. This is appropriate when you have completed the dimensional modeling exercises (Weeks 1-2) and are ready to work with distributed processing frameworks on the same data modeling concepts at scale.
Execution Steps
Step 1: Launch_Spark_Environment
Start the Docker Compose stack that provisions the Spark cluster with Iceberg support. The environment includes a Spark master, worker nodes, a Jupyter notebook server, an Iceberg REST catalog, and MinIO for S3-compatible object storage.
Key considerations:
- The docker-compose.yaml defines services for spark-iceberg, minio, mc (MinIO client), and rest catalog
- Jupyter notebook becomes accessible at localhost:8888
- The Spark environment is pre-configured with Iceberg catalog extensions
- MinIO provides the storage layer backing the Iceberg warehouse
Step 2: Access_Notebook_Interface
Connect to the Jupyter notebook server at localhost:8888 to interact with the Spark cluster. The first notebook to run is event_data_pyspark.ipynb inside the notebooks folder, which introduces the data structures and Spark operations used in the transformation jobs.
Key considerations:
- Jupyter provides an interactive environment for exploring Spark DataFrames
- The notebook establishes a SparkSession with Iceberg catalog configuration
- Use this step to understand the data schema before running batch jobs
Step 3: Configure_Spark_Session
Each PySpark job initializes a SparkSession configured for local execution with the appropriate application name. The session connects to the Iceberg catalog and provides access to the underlying tables for reading and writing.
What happens:
- SparkSession.builder creates a session with master("local") for development
- Application name identifies the job in the Spark UI
- The session provides access to Iceberg-managed tables via spark.table()
Step 4: Execute_Transformation_Jobs
Run the PySpark transformation jobs that implement core data engineering patterns. Three jobs are provided, each following the same architectural pattern: register a DataFrame as a temporary SQL view, execute a SQL transformation query, and write results to an output table.
Available transformations:
- Player SCD Job: Implements SCD Type 2 change tracking using window functions (LAG, SUM with CASE) to identify streaks where a player's scoring class changes over seasons
- Monthly Site Hits Job: Aggregates hit arrays by month using COALESCE and array element access (get function), partitioned by date
- Team Vertex Job: Generates graph vertices from team data for graph-based analytics
Pattern:
- Register input DataFrame as temporary view
- Execute SQL transformation via spark.sql()
- Return transformed DataFrame
Step 5: Write_Results
Persist the transformation results back to Iceberg tables using the DataFrame write API. Results are written in overwrite mode using insertInto, which respects the target table's schema and partitioning.
Key considerations:
- write.mode("overwrite") replaces existing data in the target table
- insertInto requires the target table to already exist with a matching schema
- Iceberg handles schema evolution and partition management transparently