Implementation:DataTalksClub Data engineering zoomcamp Spark Read Parquet

Page Metadata
Knowledge Sources	repo: DataTalksClub/data-engineering-zoomcamp, Spark docs: PySpark API Reference
Domains	Data_Engineering, Batch_Processing
Last Updated	2026-02-09 14:00 GMT

Overview

Concrete tool for reading Apache Parquet files into PySpark DataFrames, leveraging Spark's built-in Parquet reader for schema-aware, columnar data ingestion.

Description

The spark.read.parquet(path) method reads one or more Parquet files from the specified path into a distributed DataFrame. In this implementation, two separate Parquet datasets are ingested: green taxi trip data and yellow taxi trip data. Each is loaded into its own DataFrame for independent downstream processing.

The path argument supports glob patterns (e.g., data/pq/green/2020/*/), allowing multiple partitioned directories to be read in a single call. Spark automatically discovers the schema from the Parquet file metadata and distributes the data across cluster executors.

This is a Wrapper Doc implementation that wraps PySpark's DataFrameReader.parquet() method for batch data ingestion.

Usage

Use this implementation when:

Loading taxi trip data (or any Parquet-formatted data) into Spark for batch processing
Reading from partitioned directory structures using glob patterns
Schema inference from Parquet metadata is desired (no manual schema definition needed)

Code Reference

Source Location: 06-batch/code/06_spark_sql.py, lines 28-34

Signature:

spark.read.parquet(path) -> DataFrame

Import:

from pyspark.sql import SparkSession

I/O Contract

Inputs:

Parameter	Type	Required	Description
path	str	Yes	File system path or glob pattern pointing to Parquet files (e.g., `data/pq/green/2020/*/`)

Outputs:

Output	Type	Description
df_green	DataFrame	Distributed DataFrame containing all rows and columns from the green taxi Parquet files
df_yellow	DataFrame	Distributed DataFrame containing all rows and columns from the yellow taxi Parquet files

Usage Examples

Reading green and yellow taxi data:

df_green = spark.read.parquet(input_green)

df_yellow = spark.read.parquet(input_yellow)

Reading with explicit glob patterns:

# Read all months of 2020 green taxi data
df_green = spark.read.parquet('data/pq/green/2020/*/')

# Read all months of 2020 yellow taxi data
df_yellow = spark.read.parquet('data/pq/yellow/2020/*/')

Reading a single partition:

# Read only January 2021 green taxi data
df_green_jan = spark.read.parquet('data/pq/green/2021/01/')

Inspecting the ingested data:

df_green = spark.read.parquet(input_green)
df_green.printSchema()
df_green.show(5)
print(f"Row count: {df_green.count()}")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment