Implementation:DataTalksClub Data engineering zoomcamp Spark Write Parquet

Page Metadata
Knowledge Sources	repo: DataTalksClub/data-engineering-zoomcamp, Spark docs: PySpark API Reference
Domains	Data_Engineering, Batch_Processing
Last Updated	2026-02-09 14:00 GMT

Overview

Concrete tool for writing a PySpark DataFrame to Parquet format with single-partition coalescing and overwrite semantics, producing one output file for downstream consumption.

Description

This implementation chains two PySpark operations to write the aggregated revenue results to disk:

df.coalesce(1) -- Reduces the DataFrame to a single partition. Since the aggregated result is relatively small (one row per unique combination of zone, month, and service type), coalescing to one partition is efficient and produces a single output Parquet file rather than many small files.
.write.parquet(output, mode='overwrite') -- Writes the coalesced DataFrame to the specified output path in Parquet format. The overwrite mode ensures that any previously existing data at the output path is replaced, making the pipeline idempotent.

The output directory will contain:

One Parquet data file (e.g., part-00000-*.snappy.parquet)
A _SUCCESS marker file indicating successful completion
A _common_metadata and _metadata file with schema information

This is a Wrapper Doc implementation wrapping PySpark's DataFrame.coalesce() and DataFrameWriter.parquet() methods.

Usage

Use this implementation when:

Writing aggregated analytical results that are small enough for a single output file
The output must be idempotent (safe to overwrite on pipeline reruns)
Downstream consumers expect a single Parquet file rather than a partitioned directory
Persisting the final stage of a batch ETL pipeline

Code Reference

Source Location: 06-batch/code/06_spark_sql.py, lines 106-107

Signature:

df.coalesce(1).write.parquet(output, mode='overwrite')

Import:

from pyspark.sql import SparkSession

I/O Contract

Inputs:

Parameter	Type	Required	Description
df_result	DataFrame	Yes	The aggregated revenue DataFrame (13 columns: 3 dimensions + 10 measures)
output	str	Yes	File system path for the output Parquet directory (from `--output` CLI argument)
numPartitions (coalesce)	int	Yes	Number of partitions to reduce to; set to `1` for single-file output
mode	str	Yes	Write mode; set to `'overwrite'` for idempotent writes

Outputs:

Output	Type	Description
Parquet directory	Directory on disk	Output directory containing one Parquet data file, a `_SUCCESS` marker, and metadata files
part-00000-*.snappy.parquet	File	Single Parquet data file with all aggregated revenue rows
_SUCCESS	File	Empty marker file confirming successful write completion

Usage Examples

Writing aggregated results to Parquet:

df_result.coalesce(1) \
    .write.parquet(output, mode='overwrite')

Writing with a specific output path:

df_result.coalesce(1) \
    .write.parquet('data/report/revenue/', mode='overwrite')

Writing with multiple partitions (alternative for large results):

# If the result is too large for a single partition,
# use more partitions for parallel writing
df_result.coalesce(4) \
    .write.parquet(output, mode='overwrite')

Writing with partitioning by column:

# Alternative: partition the output by service_type
df_result.write.partitionBy('service_type') \
    .parquet(output, mode='overwrite')

Verifying the output:

# Read back the written data to verify
df_verify = spark.read.parquet(output)
print(f"Rows written: {df_verify.count()}")
print(f"Columns: {df_verify.columns}")
df_verify.show(5)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment