Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:DataTalksClub Data engineering zoomcamp PySpark Batch Environment

From Leeroopedia


Knowledge Sources
Domains Batch_Processing, Data_Transformation
Last Updated 2026-02-09 07:00 GMT

06-batch/code/06_spark_sql.py

Overview

PySpark environment with Apache Spark, Java JDK, and Hadoop for batch processing of NYC taxi trip Parquet data.

Description

This environment provides the Apache Spark runtime for batch processing workflows. The PySpark script reads Parquet files, normalizes schemas across green and yellow taxi datasets, performs SQL-based revenue aggregations, and writes partitioned Parquet output. Spark requires a Java JDK and optionally Hadoop/YARN for cluster deployment. For local development, Spark runs in standalone mode.

Usage

Use this environment for any batch processing or large-scale data transformation workflow using Apache Spark. It is the mandatory prerequisite for running the SparkSession_Builder, Spark_Read_Parquet, Spark_WithColumnRenamed, Spark_UnionAll, Spark_SQL_Aggregation, and Spark_Write_Parquet implementations.

System Requirements

Category Requirement Notes
OS Linux, macOS, or Windows Linux recommended for production
Java JDK 8 or JDK 11 Required by Spark runtime; OpenJDK recommended
RAM 4GB minimum (8GB recommended) Spark workers need memory for in-memory processing
Disk ~5GB free For Spark installation, input data, and output Parquet files
Network Internet access For downloading taxi data Parquet files

Dependencies

System Packages

  • Java JDK 8 or 11 (`JAVA_HOME` must be set)
  • Apache Spark 3.x (`SPARK_HOME` must be set)
  • Hadoop (optional, for HDFS/YARN cluster mode)

Python Packages

  • `pyspark` (version matching installed Spark)

Environment Variables

  • `JAVA_HOME`: Path to JDK installation
  • `SPARK_HOME`: Path to Spark installation
  • `PATH`: Must include `$SPARK_HOME/bin`

Credentials

No credentials required for local standalone mode.

For the BigQuery variant (`06_spark_sql_big_query.py`):

  • GCP service account with BigQuery write permissions
  • `spark.hadoop.google.cloud.auth.service.account.json.keyfile`: Path to service account JSON

Quick Install

# Install PySpark via pip
pip install pyspark

# Verify installation
spark-submit --version

# Download taxi data
cd 06-batch/code
bash download_data.sh

Code Evidence

Spark session initialization from `06_spark_sql.py:24-26`:

spark = SparkSession.builder \
    .appName('test') \
    .getOrCreate()

PySpark imports from `06_spark_sql.py:6-8`:

import pyspark
from pyspark.sql import SparkSession
from pyspark.sql import functions as F

Coalesce output for single-file write from `06_spark_sql.py:106-107`:

df_result.coalesce(1) \
    .write.parquet(output, mode='overwrite')

Common Errors

Error Message Cause Solution
`JAVA_HOME is not set` Java JDK not installed or not configured Install JDK 8/11 and set `JAVA_HOME` environment variable
`Py4JJavaError: ... ClassNotFoundException` Missing Spark jars for connectors Ensure `SPARK_HOME` is set and connector jars are in the classpath
`java.lang.OutOfMemoryError: Java heap space` Insufficient driver/executor memory Increase with `--driver-memory 4g` or `--executor-memory 4g`

Compatibility Notes

  • Java version: Spark 3.x officially supports JDK 8 and JDK 11. JDK 17+ may work but is not officially supported for all Spark versions.
  • Platform setup guides: The repository includes platform-specific Spark setup instructions for Linux (`06-batch/setup/linux.md`), macOS (`06-batch/setup/macos.md`), and Windows (`06-batch/setup/windows.md`).
  • Cloud deployment: The BigQuery variant (`06_spark_sql_big_query.py`) requires additional GCP connector jars and service account credentials.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment