Implementation:DataTalksClub Data engineering zoomcamp Spark Read Parquet
| Page Metadata | |
|---|---|
| Knowledge Sources | repo: DataTalksClub/data-engineering-zoomcamp, Spark docs: PySpark API Reference |
| Domains | Data_Engineering, Batch_Processing |
| Last Updated | 2026-02-09 14:00 GMT |
Overview
Concrete tool for reading Apache Parquet files into PySpark DataFrames, leveraging Spark's built-in Parquet reader for schema-aware, columnar data ingestion.
Description
The spark.read.parquet(path) method reads one or more Parquet files from the specified path into a distributed DataFrame. In this implementation, two separate Parquet datasets are ingested: green taxi trip data and yellow taxi trip data. Each is loaded into its own DataFrame for independent downstream processing.
The path argument supports glob patterns (e.g., data/pq/green/2020/*/), allowing multiple partitioned directories to be read in a single call. Spark automatically discovers the schema from the Parquet file metadata and distributes the data across cluster executors.
This is a Wrapper Doc implementation that wraps PySpark's DataFrameReader.parquet() method for batch data ingestion.
Usage
Use this implementation when:
- Loading taxi trip data (or any Parquet-formatted data) into Spark for batch processing
- Reading from partitioned directory structures using glob patterns
- Schema inference from Parquet metadata is desired (no manual schema definition needed)
Code Reference
Source Location: 06-batch/code/06_spark_sql.py, lines 28-34
Signature:
spark.read.parquet(path) -> DataFrame
Import:
from pyspark.sql import SparkSession
I/O Contract
Inputs:
| Parameter | Type | Required | Description |
|---|---|---|---|
| path | str | Yes | File system path or glob pattern pointing to Parquet files (e.g., data/pq/green/2020/*/)
|
Outputs:
| Output | Type | Description |
|---|---|---|
| df_green | DataFrame | Distributed DataFrame containing all rows and columns from the green taxi Parquet files |
| df_yellow | DataFrame | Distributed DataFrame containing all rows and columns from the yellow taxi Parquet files |
Usage Examples
Reading green and yellow taxi data:
df_green = spark.read.parquet(input_green)
df_yellow = spark.read.parquet(input_yellow)
Reading with explicit glob patterns:
# Read all months of 2020 green taxi data
df_green = spark.read.parquet('data/pq/green/2020/*/')
# Read all months of 2020 yellow taxi data
df_yellow = spark.read.parquet('data/pq/yellow/2020/*/')
Reading a single partition:
# Read only January 2021 green taxi data
df_green_jan = spark.read.parquet('data/pq/green/2021/01/')
Inspecting the ingested data:
df_green = spark.read.parquet(input_green)
df_green.printSchema()
df_green.show(5)
print(f"Row count: {df_green.count()}")