Environment:DataTalksClub Data engineering zoomcamp PySpark Batch Environment

Knowledge Sources	Data Engineering Zoomcamp Apache Spark Documentation
Domains	Batch_Processing, Data_Transformation
Last Updated	2026-02-09 07:00 GMT

06-batch/code/06_spark_sql.py

Overview

PySpark environment with Apache Spark, Java JDK, and Hadoop for batch processing of NYC taxi trip Parquet data.

Description

This environment provides the Apache Spark runtime for batch processing workflows. The PySpark script reads Parquet files, normalizes schemas across green and yellow taxi datasets, performs SQL-based revenue aggregations, and writes partitioned Parquet output. Spark requires a Java JDK and optionally Hadoop/YARN for cluster deployment. For local development, Spark runs in standalone mode.

Usage

Use this environment for any batch processing or large-scale data transformation workflow using Apache Spark. It is the mandatory prerequisite for running the SparkSession_Builder, Spark_Read_Parquet, Spark_WithColumnRenamed, Spark_UnionAll, Spark_SQL_Aggregation, and Spark_Write_Parquet implementations.

System Requirements

Category	Requirement	Notes
OS	Linux, macOS, or Windows	Linux recommended for production
Java	JDK 8 or JDK 11	Required by Spark runtime; OpenJDK recommended
RAM	4GB minimum (8GB recommended)	Spark workers need memory for in-memory processing
Disk	~5GB free	For Spark installation, input data, and output Parquet files
Network	Internet access	For downloading taxi data Parquet files

Dependencies

System Packages

Java JDK 8 or 11 (`JAVA_HOME` must be set)
Apache Spark 3.x (`SPARK_HOME` must be set)
Hadoop (optional, for HDFS/YARN cluster mode)

Python Packages

`pyspark` (version matching installed Spark)

Environment Variables

`JAVA_HOME`: Path to JDK installation
`SPARK_HOME`: Path to Spark installation
`PATH`: Must include `$SPARK_HOME/bin`

Credentials

No credentials required for local standalone mode.

For the BigQuery variant (`06_spark_sql_big_query.py`):

GCP service account with BigQuery write permissions
`spark.hadoop.google.cloud.auth.service.account.json.keyfile`: Path to service account JSON

Quick Install

# Install PySpark via pip
pip install pyspark

# Verify installation
spark-submit --version

# Download taxi data
cd 06-batch/code
bash download_data.sh

Code Evidence

Spark session initialization from `06_spark_sql.py:24-26`:

spark = SparkSession.builder \
    .appName('test') \
    .getOrCreate()

PySpark imports from `06_spark_sql.py:6-8`:

import pyspark
from pyspark.sql import SparkSession
from pyspark.sql import functions as F

Coalesce output for single-file write from `06_spark_sql.py:106-107`:

df_result.coalesce(1) \
    .write.parquet(output, mode='overwrite')

Common Errors

Error Message	Cause	Solution
`JAVA_HOME is not set`	Java JDK not installed or not configured	Install JDK 8/11 and set `JAVA_HOME` environment variable
`Py4JJavaError: ... ClassNotFoundException`	Missing Spark jars for connectors	Ensure `SPARK_HOME` is set and connector jars are in the classpath
`java.lang.OutOfMemoryError: Java heap space`	Insufficient driver/executor memory	Increase with `--driver-memory 4g` or `--executor-memory 4g`

Compatibility Notes

Java version: Spark 3.x officially supports JDK 8 and JDK 11. JDK 17+ may work but is not officially supported for all Spark versions.
Platform setup guides: The repository includes platform-specific Spark setup instructions for Linux (`06-batch/setup/linux.md`), macOS (`06-batch/setup/macos.md`), and Windows (`06-batch/setup/windows.md`).
Cloud deployment: The BigQuery variant (`06_spark_sql_big_query.py`) requires additional GCP connector jars and service account credentials.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment