Implementation:DataTalksClub Data engineering zoomcamp Spark Write Parquet
| Page Metadata | |
|---|---|
| Knowledge Sources | repo: DataTalksClub/data-engineering-zoomcamp, Spark docs: PySpark API Reference |
| Domains | Data_Engineering, Batch_Processing |
| Last Updated | 2026-02-09 14:00 GMT |
Overview
Concrete tool for writing a PySpark DataFrame to Parquet format with single-partition coalescing and overwrite semantics, producing one output file for downstream consumption.
Description
This implementation chains two PySpark operations to write the aggregated revenue results to disk:
df.coalesce(1)-- Reduces the DataFrame to a single partition. Since the aggregated result is relatively small (one row per unique combination of zone, month, and service type), coalescing to one partition is efficient and produces a single output Parquet file rather than many small files..write.parquet(output, mode='overwrite')-- Writes the coalesced DataFrame to the specified output path in Parquet format. Theoverwritemode ensures that any previously existing data at the output path is replaced, making the pipeline idempotent.
The output directory will contain:
- One Parquet data file (e.g.,
part-00000-*.snappy.parquet) - A
_SUCCESSmarker file indicating successful completion - A
_common_metadataand_metadatafile with schema information
This is a Wrapper Doc implementation wrapping PySpark's DataFrame.coalesce() and DataFrameWriter.parquet() methods.
Usage
Use this implementation when:
- Writing aggregated analytical results that are small enough for a single output file
- The output must be idempotent (safe to overwrite on pipeline reruns)
- Downstream consumers expect a single Parquet file rather than a partitioned directory
- Persisting the final stage of a batch ETL pipeline
Code Reference
Source Location: 06-batch/code/06_spark_sql.py, lines 106-107
Signature:
df.coalesce(1).write.parquet(output, mode='overwrite')
Import:
from pyspark.sql import SparkSession
I/O Contract
Inputs:
| Parameter | Type | Required | Description |
|---|---|---|---|
| df_result | DataFrame | Yes | The aggregated revenue DataFrame (13 columns: 3 dimensions + 10 measures) |
| output | str | Yes | File system path for the output Parquet directory (from --output CLI argument)
|
| numPartitions (coalesce) | int | Yes | Number of partitions to reduce to; set to 1 for single-file output
|
| mode | str | Yes | Write mode; set to 'overwrite' for idempotent writes
|
Outputs:
| Output | Type | Description |
|---|---|---|
| Parquet directory | Directory on disk | Output directory containing one Parquet data file, a _SUCCESS marker, and metadata files
|
| part-00000-*.snappy.parquet | File | Single Parquet data file with all aggregated revenue rows |
| _SUCCESS | File | Empty marker file confirming successful write completion |
Usage Examples
Writing aggregated results to Parquet:
df_result.coalesce(1) \
.write.parquet(output, mode='overwrite')
Writing with a specific output path:
df_result.coalesce(1) \
.write.parquet('data/report/revenue/', mode='overwrite')
Writing with multiple partitions (alternative for large results):
# If the result is too large for a single partition,
# use more partitions for parallel writing
df_result.coalesce(4) \
.write.parquet(output, mode='overwrite')
Writing with partitioning by column:
# Alternative: partition the output by service_type
df_result.write.partitionBy('service_type') \
.parquet(output, mode='overwrite')
Verifying the output:
# Read back the written data to verify
df_verify = spark.read.parquet(output)
print(f"Rows written: {df_verify.count()}")
print(f"Columns: {df_verify.columns}")
df_verify.show(5)