Implementation:DataTalksClub Data engineering zoomcamp Spark WithColumnRenamed
| Page Metadata | |
|---|---|
| Knowledge Sources | repo: DataTalksClub/data-engineering-zoomcamp, Spark docs: PySpark API Reference |
| Domains | Data_Engineering, Batch_Processing |
| Last Updated | 2026-02-09 14:00 GMT |
Overview
Concrete tool for renaming DataFrame columns in PySpark to harmonize schemas across green and yellow taxi datasets before combining them.
Description
The df.withColumnRenamed(existing, new) method returns a new DataFrame with the specified column renamed. In this implementation, it is used to normalize the pickup and dropoff datetime column names across two taxi datasets that use different prefixes:
- Green taxi columns use the
lpep_prefix:lpep_pickup_datetime,lpep_dropoff_datetime - Yellow taxi columns use the
tpep_prefix:tpep_pickup_datetime,tpep_dropoff_datetime
Both are renamed to the canonical forms pickup_datetime and dropoff_datetime, enabling the two datasets to share a common schema for subsequent union operations.
Multiple withColumnRenamed() calls are chained together in a fluent style. Each call returns a new DataFrame (Spark DataFrames are immutable), so the chain produces a final DataFrame with all renames applied.
This is a Wrapper Doc implementation wrapping PySpark's DataFrame.withColumnRenamed() method.
Usage
Use this implementation when:
- Aligning column names across datasets from different taxi service types
- Preparing DataFrames for a union operation that requires matching column names
- Renaming columns that have vendor-specific or source-specific prefixes
Code Reference
Source Location: 06-batch/code/06_spark_sql.py, lines 30-39
Signature:
df.withColumnRenamed(existing, new) -> DataFrame
Import:
from pyspark.sql import SparkSession
I/O Contract
Inputs:
| Parameter | Type | Required | Description |
|---|---|---|---|
| existing | str | Yes | The current column name to be renamed |
| new | str | Yes | The new column name to assign |
Green Taxi Renames:
| Original Column | Renamed Column |
|---|---|
| lpep_pickup_datetime | pickup_datetime |
| lpep_dropoff_datetime | dropoff_datetime |
Yellow Taxi Renames:
| Original Column | Renamed Column |
|---|---|
| tpep_pickup_datetime | pickup_datetime |
| tpep_dropoff_datetime | dropoff_datetime |
Outputs:
| Output | Type | Description |
|---|---|---|
| df_green | DataFrame | Green taxi DataFrame with normalized datetime column names |
| df_yellow | DataFrame | Yellow taxi DataFrame with normalized datetime column names |
Usage Examples
Renaming green taxi columns:
df_green = spark.read.parquet(input_green)
df_green = df_green \
.withColumnRenamed('lpep_pickup_datetime', 'pickup_datetime') \
.withColumnRenamed('lpep_dropoff_datetime', 'dropoff_datetime')
Renaming yellow taxi columns:
df_yellow = spark.read.parquet(input_yellow)
df_yellow = df_yellow \
.withColumnRenamed('tpep_pickup_datetime', 'pickup_datetime') \
.withColumnRenamed('tpep_dropoff_datetime', 'dropoff_datetime')
Verifying the rename:
# Before rename
df_green.columns
# ['VendorID', 'lpep_pickup_datetime', 'lpep_dropoff_datetime', ...]
# After rename
df_green = df_green \
.withColumnRenamed('lpep_pickup_datetime', 'pickup_datetime') \
.withColumnRenamed('lpep_dropoff_datetime', 'dropoff_datetime')
df_green.columns
# ['VendorID', 'pickup_datetime', 'dropoff_datetime', ...]