Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:DataTalksClub Data engineering zoomcamp SparkSession Builder

From Leeroopedia


Page Metadata
Knowledge Sources repo: DataTalksClub/data-engineering-zoomcamp, Spark docs: PySpark API Reference
Domains Data_Engineering, Batch_Processing
Last Updated 2026-02-09 14:00 GMT

Overview

Concrete tool for initializing an Apache Spark distributed computing session using the PySpark builder pattern, combined with argument parsing for parameterized batch job execution.

Description

The SparkSession.builder API provides a fluent interface for constructing a SparkSession, the unified entry point for all Spark functionality in PySpark. In this implementation, the builder is configured with an application name ('test') and uses getOrCreate() to either retrieve an existing session or instantiate a new one.

Before the session is created, the script uses Python's argparse module to parse three required command-line arguments: --input_green, --input_yellow, and --output. These arguments parameterize the batch pipeline, allowing the same script to be run against different datasets and output locations without code modification.

This pattern is a Wrapper Doc implementation, combining PySpark's session builder with standard Python argument parsing to create a reusable, parameterized batch processing entry point.

Usage

Use this pattern to bootstrap any PySpark batch job that requires:

  • A named Spark session for cluster resource management and monitoring
  • External parameterization of input and output paths via command-line arguments
  • Singleton session semantics to prevent duplicate session creation

Code Reference

Source Location: 06-batch/code/06_spark_sql.py, lines 24-26 (session), lines 11-21 (argument parsing)

Signature:

SparkSession.builder.appName('test').getOrCreate() -> SparkSession

Import:

import argparse
from pyspark.sql import SparkSession

CLI Arguments:

  • --input_green (required) -- Path to green taxi parquet data
  • --input_yellow (required) -- Path to yellow taxi parquet data
  • --output (required) -- Path for output parquet data

I/O Contract

Inputs:

Parameter Type Required Description
appName str Yes Application name for Spark UI and logging
--input_green str (CLI arg) Yes File path to green taxi parquet dataset
--input_yellow str (CLI arg) Yes File path to yellow taxi parquet dataset
--output str (CLI arg) Yes File path for writing output parquet data

Outputs:

Output Type Description
spark SparkSession Active Spark session instance used for all subsequent read, transform, and write operations
input_green str Parsed path string for green taxi data
input_yellow str Parsed path string for yellow taxi data
output str Parsed path string for output destination

Usage Examples

Running the batch job from the command line:

python 06_spark_sql.py \
    --input_green data/pq/green/2020/*/ \
    --input_yellow data/pq/yellow/2020/*/ \
    --output data/report/revenue/

Session initialization within the script:

import argparse
from pyspark.sql import SparkSession

parser = argparse.ArgumentParser()
parser.add_argument('--input_green', required=True)
parser.add_argument('--input_yellow', required=True)
parser.add_argument('--output', required=True)
args = parser.parse_args()

input_green = args.input_green
input_yellow = args.input_yellow
output = args.output

spark = SparkSession.builder \
    .appName('test') \
    .getOrCreate()

Submitting via spark-submit:

spark-submit \
    --master spark://localhost:7077 \
    06_spark_sql.py \
    --input_green data/pq/green/2021/*/ \
    --input_yellow data/pq/yellow/2021/*/ \
    --output data/report/revenue/2021/

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment