Workflow:DataTalksClub Data engineering zoomcamp Spark Batch Processing

Knowledge Sources	Data Engineering Zoomcamp Apache Spark Documentation PySpark API Reference
Domains	Data_Engineering, Batch_Processing, Big_Data
Last Updated	2026-02-09 07:00 GMT

Overview

End-to-end batch processing pipeline using PySpark to read, unify, and aggregate NYC taxi trip data from Parquet files, producing monthly revenue reports by zone and service type.

Description

This workflow demonstrates batch data processing using Apache Spark with PySpark. It reads green and yellow taxi trip Parquet files, normalizes column names across the two schemas, unions the datasets with a service type discriminator, registers the combined data as a Spark SQL temporary table, and executes an aggregation query to compute monthly revenue metrics grouped by pickup zone, month, and service type. The output is written as a coalesced Parquet file. A BigQuery variant writes results directly to Google BigQuery via the Spark BigQuery connector.

Usage

Execute this workflow when you have large volumes of taxi trip data in Parquet format and need to compute aggregate analytics that exceed the capacity of single-machine tools. Use it when your data is stored locally or on a cloud storage system (GCS, S3) and you have a Spark cluster available (local mode, standalone, or cloud-managed like Dataproc).

Execution Steps

Step 1: Spark Session Initialization

Create a SparkSession with an application name and any required configurations. For local development, Spark runs in local mode. For cloud deployment (BigQuery variant), configure the BigQuery connector JAR, temporary GCS bucket, and project credentials.

Key considerations:

Local mode uses all available cores by default
BigQuery connector requires the spark-bigquery-connector JAR on the classpath
Temporary GCS bucket is needed for BigQuery write staging

Step 2: Data Ingestion

Read green and yellow taxi trip data from Parquet files using Spark's native Parquet reader. Input paths are provided as command-line arguments, supporting both local file paths and cloud storage URIs (gs://, s3://).

Key considerations:

Parquet schema is inferred automatically from file metadata
Input paths can use wildcards to read multiple partitions
Spark parallelizes reads across available executors

Step 3: Schema Normalization

Rename pickup and dropoff datetime columns to a common naming convention. Green taxi data uses lpep_ prefix while yellow taxi data uses tpep_ prefix. Both are renamed to pickup_datetime and dropoff_datetime for consistency.

Key considerations:

Column renaming uses withColumnRenamed for clarity
Only the differing datetime columns need normalization
All other shared columns have identical names across both datasets

Step 4: Dataset Unification

Select the common columns from both normalized DataFrames, add a service_type literal column (green or yellow), and union the two datasets into a single DataFrame. Register the unified DataFrame as a temporary SQL table for querying.

Key considerations:

Only 18 common columns are selected, excluding dataset-specific fields
The service_type column enables grouping by taxi type in aggregations
registerTempTable makes the data available for Spark SQL queries

Step 5: Revenue Aggregation

Execute a Spark SQL query that computes monthly revenue metrics grouped by pickup zone, revenue month, and service type. Metrics include sums of fare, extras, taxes, tips, tolls, surcharges, and total amount, plus averages of passenger count and trip distance.

Key considerations:

date_trunc truncates pickup datetime to month granularity
GROUP BY uses positional references (1, 2, 3) for brevity
All monetary columns are summed independently for detailed revenue breakdown

Step 6: Output Writing

Write the aggregated results to a Parquet file using coalesce(1) to produce a single output file. The BigQuery variant instead writes directly to a BigQuery table using the Spark BigQuery connector with overwrite mode.

Key considerations:

coalesce(1) reduces partitions to a single file for easy downstream consumption
Overwrite mode replaces existing output on re-runs
BigQuery writes use a temporary GCS bucket for staging

Execution Diagram

GitHub URL

Workflow Repository