Implementation:DataTalksClub Data engineering zoomcamp SparkSession Builder
| Page Metadata | |
|---|---|
| Knowledge Sources | repo: DataTalksClub/data-engineering-zoomcamp, Spark docs: PySpark API Reference |
| Domains | Data_Engineering, Batch_Processing |
| Last Updated | 2026-02-09 14:00 GMT |
Overview
Concrete tool for initializing an Apache Spark distributed computing session using the PySpark builder pattern, combined with argument parsing for parameterized batch job execution.
Description
The SparkSession.builder API provides a fluent interface for constructing a SparkSession, the unified entry point for all Spark functionality in PySpark. In this implementation, the builder is configured with an application name ('test') and uses getOrCreate() to either retrieve an existing session or instantiate a new one.
Before the session is created, the script uses Python's argparse module to parse three required command-line arguments: --input_green, --input_yellow, and --output. These arguments parameterize the batch pipeline, allowing the same script to be run against different datasets and output locations without code modification.
This pattern is a Wrapper Doc implementation, combining PySpark's session builder with standard Python argument parsing to create a reusable, parameterized batch processing entry point.
Usage
Use this pattern to bootstrap any PySpark batch job that requires:
- A named Spark session for cluster resource management and monitoring
- External parameterization of input and output paths via command-line arguments
- Singleton session semantics to prevent duplicate session creation
Code Reference
Source Location: 06-batch/code/06_spark_sql.py, lines 24-26 (session), lines 11-21 (argument parsing)
Signature:
SparkSession.builder.appName('test').getOrCreate() -> SparkSession
Import:
import argparse
from pyspark.sql import SparkSession
CLI Arguments:
--input_green(required) -- Path to green taxi parquet data--input_yellow(required) -- Path to yellow taxi parquet data--output(required) -- Path for output parquet data
I/O Contract
Inputs:
| Parameter | Type | Required | Description |
|---|---|---|---|
| appName | str | Yes | Application name for Spark UI and logging |
| --input_green | str (CLI arg) | Yes | File path to green taxi parquet dataset |
| --input_yellow | str (CLI arg) | Yes | File path to yellow taxi parquet dataset |
| --output | str (CLI arg) | Yes | File path for writing output parquet data |
Outputs:
| Output | Type | Description |
|---|---|---|
| spark | SparkSession | Active Spark session instance used for all subsequent read, transform, and write operations |
| input_green | str | Parsed path string for green taxi data |
| input_yellow | str | Parsed path string for yellow taxi data |
| output | str | Parsed path string for output destination |
Usage Examples
Running the batch job from the command line:
python 06_spark_sql.py \
--input_green data/pq/green/2020/*/ \
--input_yellow data/pq/yellow/2020/*/ \
--output data/report/revenue/
Session initialization within the script:
import argparse
from pyspark.sql import SparkSession
parser = argparse.ArgumentParser()
parser.add_argument('--input_green', required=True)
parser.add_argument('--input_yellow', required=True)
parser.add_argument('--output', required=True)
args = parser.parse_args()
input_green = args.input_green
input_yellow = args.input_yellow
output = args.output
spark = SparkSession.builder \
.appName('test') \
.getOrCreate()
Submitting via spark-submit:
spark-submit \
--master spark://localhost:7077 \
06_spark_sql.py \
--input_green data/pq/green/2021/*/ \
--input_yellow data/pq/yellow/2021/*/ \
--output data/report/revenue/2021/