Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:DataTalksClub Data engineering zoomcamp Spark Read Parquet

From Leeroopedia


Page Metadata
Knowledge Sources repo: DataTalksClub/data-engineering-zoomcamp, Spark docs: PySpark API Reference
Domains Data_Engineering, Batch_Processing
Last Updated 2026-02-09 14:00 GMT

Overview

Concrete tool for reading Apache Parquet files into PySpark DataFrames, leveraging Spark's built-in Parquet reader for schema-aware, columnar data ingestion.

Description

The spark.read.parquet(path) method reads one or more Parquet files from the specified path into a distributed DataFrame. In this implementation, two separate Parquet datasets are ingested: green taxi trip data and yellow taxi trip data. Each is loaded into its own DataFrame for independent downstream processing.

The path argument supports glob patterns (e.g., data/pq/green/2020/*/), allowing multiple partitioned directories to be read in a single call. Spark automatically discovers the schema from the Parquet file metadata and distributes the data across cluster executors.

This is a Wrapper Doc implementation that wraps PySpark's DataFrameReader.parquet() method for batch data ingestion.

Usage

Use this implementation when:

  • Loading taxi trip data (or any Parquet-formatted data) into Spark for batch processing
  • Reading from partitioned directory structures using glob patterns
  • Schema inference from Parquet metadata is desired (no manual schema definition needed)

Code Reference

Source Location: 06-batch/code/06_spark_sql.py, lines 28-34

Signature:

spark.read.parquet(path) -> DataFrame

Import:

from pyspark.sql import SparkSession

I/O Contract

Inputs:

Parameter Type Required Description
path str Yes File system path or glob pattern pointing to Parquet files (e.g., data/pq/green/2020/*/)

Outputs:

Output Type Description
df_green DataFrame Distributed DataFrame containing all rows and columns from the green taxi Parquet files
df_yellow DataFrame Distributed DataFrame containing all rows and columns from the yellow taxi Parquet files

Usage Examples

Reading green and yellow taxi data:

df_green = spark.read.parquet(input_green)

df_yellow = spark.read.parquet(input_yellow)

Reading with explicit glob patterns:

# Read all months of 2020 green taxi data
df_green = spark.read.parquet('data/pq/green/2020/*/')

# Read all months of 2020 yellow taxi data
df_yellow = spark.read.parquet('data/pq/yellow/2020/*/')

Reading a single partition:

# Read only January 2021 green taxi data
df_green_jan = spark.read.parquet('data/pq/green/2021/01/')

Inspecting the ingested data:

df_green = spark.read.parquet(input_green)
df_green.printSchema()
df_green.show(5)
print(f"Row count: {df_green.count()}")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment