Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:DataTalksClub Data engineering zoomcamp Spark Write Parquet

From Leeroopedia


Page Metadata
Knowledge Sources repo: DataTalksClub/data-engineering-zoomcamp, Spark docs: PySpark API Reference
Domains Data_Engineering, Batch_Processing
Last Updated 2026-02-09 14:00 GMT

Overview

Concrete tool for writing a PySpark DataFrame to Parquet format with single-partition coalescing and overwrite semantics, producing one output file for downstream consumption.

Description

This implementation chains two PySpark operations to write the aggregated revenue results to disk:

  1. df.coalesce(1) -- Reduces the DataFrame to a single partition. Since the aggregated result is relatively small (one row per unique combination of zone, month, and service type), coalescing to one partition is efficient and produces a single output Parquet file rather than many small files.
  2. .write.parquet(output, mode='overwrite') -- Writes the coalesced DataFrame to the specified output path in Parquet format. The overwrite mode ensures that any previously existing data at the output path is replaced, making the pipeline idempotent.

The output directory will contain:

  • One Parquet data file (e.g., part-00000-*.snappy.parquet)
  • A _SUCCESS marker file indicating successful completion
  • A _common_metadata and _metadata file with schema information

This is a Wrapper Doc implementation wrapping PySpark's DataFrame.coalesce() and DataFrameWriter.parquet() methods.

Usage

Use this implementation when:

  • Writing aggregated analytical results that are small enough for a single output file
  • The output must be idempotent (safe to overwrite on pipeline reruns)
  • Downstream consumers expect a single Parquet file rather than a partitioned directory
  • Persisting the final stage of a batch ETL pipeline

Code Reference

Source Location: 06-batch/code/06_spark_sql.py, lines 106-107

Signature:

df.coalesce(1).write.parquet(output, mode='overwrite')

Import:

from pyspark.sql import SparkSession

I/O Contract

Inputs:

Parameter Type Required Description
df_result DataFrame Yes The aggregated revenue DataFrame (13 columns: 3 dimensions + 10 measures)
output str Yes File system path for the output Parquet directory (from --output CLI argument)
numPartitions (coalesce) int Yes Number of partitions to reduce to; set to 1 for single-file output
mode str Yes Write mode; set to 'overwrite' for idempotent writes

Outputs:

Output Type Description
Parquet directory Directory on disk Output directory containing one Parquet data file, a _SUCCESS marker, and metadata files
part-00000-*.snappy.parquet File Single Parquet data file with all aggregated revenue rows
_SUCCESS File Empty marker file confirming successful write completion

Usage Examples

Writing aggregated results to Parquet:

df_result.coalesce(1) \
    .write.parquet(output, mode='overwrite')

Writing with a specific output path:

df_result.coalesce(1) \
    .write.parquet('data/report/revenue/', mode='overwrite')

Writing with multiple partitions (alternative for large results):

# If the result is too large for a single partition,
# use more partitions for parallel writing
df_result.coalesce(4) \
    .write.parquet(output, mode='overwrite')

Writing with partitioning by column:

# Alternative: partition the output by service_type
df_result.write.partitionBy('service_type') \
    .parquet(output, mode='overwrite')

Verifying the output:

# Read back the written data to verify
df_verify = spark.read.parquet(output)
print(f"Rows written: {df_verify.count()}")
print(f"Columns: {df_verify.columns}")
df_verify.show(5)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment