Workflow:DataTalksClub Data engineering zoomcamp Spark Batch Processing
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, Batch_Processing, Big_Data |
| Last Updated | 2026-02-09 07:00 GMT |
Overview
End-to-end batch processing pipeline using PySpark to read, unify, and aggregate NYC taxi trip data from Parquet files, producing monthly revenue reports by zone and service type.
Description
This workflow demonstrates batch data processing using Apache Spark with PySpark. It reads green and yellow taxi trip Parquet files, normalizes column names across the two schemas, unions the datasets with a service type discriminator, registers the combined data as a Spark SQL temporary table, and executes an aggregation query to compute monthly revenue metrics grouped by pickup zone, month, and service type. The output is written as a coalesced Parquet file. A BigQuery variant writes results directly to Google BigQuery via the Spark BigQuery connector.
Usage
Execute this workflow when you have large volumes of taxi trip data in Parquet format and need to compute aggregate analytics that exceed the capacity of single-machine tools. Use it when your data is stored locally or on a cloud storage system (GCS, S3) and you have a Spark cluster available (local mode, standalone, or cloud-managed like Dataproc).
Execution Steps
Step 1: Spark Session Initialization
Create a SparkSession with an application name and any required configurations. For local development, Spark runs in local mode. For cloud deployment (BigQuery variant), configure the BigQuery connector JAR, temporary GCS bucket, and project credentials.
Key considerations:
- Local mode uses all available cores by default
- BigQuery connector requires the spark-bigquery-connector JAR on the classpath
- Temporary GCS bucket is needed for BigQuery write staging
Step 2: Data Ingestion
Read green and yellow taxi trip data from Parquet files using Spark's native Parquet reader. Input paths are provided as command-line arguments, supporting both local file paths and cloud storage URIs (gs://, s3://).
Key considerations:
- Parquet schema is inferred automatically from file metadata
- Input paths can use wildcards to read multiple partitions
- Spark parallelizes reads across available executors
Step 3: Schema Normalization
Rename pickup and dropoff datetime columns to a common naming convention. Green taxi data uses lpep_ prefix while yellow taxi data uses tpep_ prefix. Both are renamed to pickup_datetime and dropoff_datetime for consistency.
Key considerations:
- Column renaming uses withColumnRenamed for clarity
- Only the differing datetime columns need normalization
- All other shared columns have identical names across both datasets
Step 4: Dataset Unification
Select the common columns from both normalized DataFrames, add a service_type literal column (green or yellow), and union the two datasets into a single DataFrame. Register the unified DataFrame as a temporary SQL table for querying.
Key considerations:
- Only 18 common columns are selected, excluding dataset-specific fields
- The service_type column enables grouping by taxi type in aggregations
- registerTempTable makes the data available for Spark SQL queries
Step 5: Revenue Aggregation
Execute a Spark SQL query that computes monthly revenue metrics grouped by pickup zone, revenue month, and service type. Metrics include sums of fare, extras, taxes, tips, tolls, surcharges, and total amount, plus averages of passenger count and trip distance.
Key considerations:
- date_trunc truncates pickup datetime to month granularity
- GROUP BY uses positional references (1, 2, 3) for brevity
- All monetary columns are summed independently for detailed revenue breakdown
Step 6: Output Writing
Write the aggregated results to a Parquet file using coalesce(1) to produce a single output file. The BigQuery variant instead writes directly to a BigQuery table using the Spark BigQuery connector with overwrite mode.
Key considerations:
- coalesce(1) reduces partitions to a single file for easy downstream consumption
- Overwrite mode replaces existing output on re-runs
- BigQuery writes use a temporary GCS bucket for staging