Environment:DataTalksClub Data engineering zoomcamp PySpark Batch Environment
| Knowledge Sources | |
|---|---|
| Domains | Batch_Processing, Data_Transformation |
| Last Updated | 2026-02-09 07:00 GMT |
Overview
PySpark environment with Apache Spark, Java JDK, and Hadoop for batch processing of NYC taxi trip Parquet data.
Description
This environment provides the Apache Spark runtime for batch processing workflows. The PySpark script reads Parquet files, normalizes schemas across green and yellow taxi datasets, performs SQL-based revenue aggregations, and writes partitioned Parquet output. Spark requires a Java JDK and optionally Hadoop/YARN for cluster deployment. For local development, Spark runs in standalone mode.
Usage
Use this environment for any batch processing or large-scale data transformation workflow using Apache Spark. It is the mandatory prerequisite for running the SparkSession_Builder, Spark_Read_Parquet, Spark_WithColumnRenamed, Spark_UnionAll, Spark_SQL_Aggregation, and Spark_Write_Parquet implementations.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux, macOS, or Windows | Linux recommended for production |
| Java | JDK 8 or JDK 11 | Required by Spark runtime; OpenJDK recommended |
| RAM | 4GB minimum (8GB recommended) | Spark workers need memory for in-memory processing |
| Disk | ~5GB free | For Spark installation, input data, and output Parquet files |
| Network | Internet access | For downloading taxi data Parquet files |
Dependencies
System Packages
- Java JDK 8 or 11 (`JAVA_HOME` must be set)
- Apache Spark 3.x (`SPARK_HOME` must be set)
- Hadoop (optional, for HDFS/YARN cluster mode)
Python Packages
- `pyspark` (version matching installed Spark)
Environment Variables
- `JAVA_HOME`: Path to JDK installation
- `SPARK_HOME`: Path to Spark installation
- `PATH`: Must include `$SPARK_HOME/bin`
Credentials
No credentials required for local standalone mode.
For the BigQuery variant (`06_spark_sql_big_query.py`):
- GCP service account with BigQuery write permissions
- `spark.hadoop.google.cloud.auth.service.account.json.keyfile`: Path to service account JSON
Quick Install
# Install PySpark via pip
pip install pyspark
# Verify installation
spark-submit --version
# Download taxi data
cd 06-batch/code
bash download_data.sh
Code Evidence
Spark session initialization from `06_spark_sql.py:24-26`:
spark = SparkSession.builder \
.appName('test') \
.getOrCreate()
PySpark imports from `06_spark_sql.py:6-8`:
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
Coalesce output for single-file write from `06_spark_sql.py:106-107`:
df_result.coalesce(1) \
.write.parquet(output, mode='overwrite')
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `JAVA_HOME is not set` | Java JDK not installed or not configured | Install JDK 8/11 and set `JAVA_HOME` environment variable |
| `Py4JJavaError: ... ClassNotFoundException` | Missing Spark jars for connectors | Ensure `SPARK_HOME` is set and connector jars are in the classpath |
| `java.lang.OutOfMemoryError: Java heap space` | Insufficient driver/executor memory | Increase with `--driver-memory 4g` or `--executor-memory 4g` |
Compatibility Notes
- Java version: Spark 3.x officially supports JDK 8 and JDK 11. JDK 17+ may work but is not officially supported for all Spark versions.
- Platform setup guides: The repository includes platform-specific Spark setup instructions for Linux (`06-batch/setup/linux.md`), macOS (`06-batch/setup/macos.md`), and Windows (`06-batch/setup/windows.md`).
- Cloud deployment: The BigQuery variant (`06_spark_sql_big_query.py`) requires additional GCP connector jars and service account credentials.
Related Pages
- Implementation:DataTalksClub_Data_engineering_zoomcamp_SparkSession_Builder
- Implementation:DataTalksClub_Data_engineering_zoomcamp_Spark_Read_Parquet
- Implementation:DataTalksClub_Data_engineering_zoomcamp_Spark_WithColumnRenamed
- Implementation:DataTalksClub_Data_engineering_zoomcamp_Spark_UnionAll
- Implementation:DataTalksClub_Data_engineering_zoomcamp_Spark_SQL_Aggregation
- Implementation:DataTalksClub_Data_engineering_zoomcamp_Spark_Write_Parquet