Principle:DataTalksClub Data engineering zoomcamp Parquet Output Writing

Page Metadata
Knowledge Sources	DataTalksClub Data Engineering Zoomcamp
Domains	Data_Engineering, Batch_Processing
Last Updated	2026-02-09 14:00 GMT

Overview

Parquet output writing is the process of persisting analytical results to a columnar storage format with explicit control over partitioning and write semantics.

Description

After batch processing completes its transformations and aggregations, the results must be written to persistent storage. Writing to Parquet format preserves the benefits of columnar storage for any downstream consumers of the data: efficient reads, built-in schema metadata, and strong compression.

Two critical decisions govern the output writing step:

Partition control (coalescing) -- In a distributed framework, the result data is naturally spread across multiple partitions (one per executor task). Writing directly would produce many small files, one per partition. Coalescing to a single partition before writing forces all data through one writer, producing exactly one output file. This is appropriate when the result set is small enough to fit in a single partition and a single output file simplifies downstream consumption (e.g., loading into a database or dashboard tool).

Write mode -- The write mode determines behavior when the output path already exists:
- Overwrite replaces the existing data entirely, ensuring idempotent pipeline execution.
- Append adds new data alongside existing data.
- Error/ErrorIfExists (the default in many frameworks) raises an exception if the output already exists.
- Ignore silently skips writing if the output already exists.

For batch ETL pipelines, overwrite mode is the standard choice because it ensures that rerunning the pipeline produces the same result regardless of previous executions (idempotency).

Usage

Use parquet output writing when:

Persisting aggregated or transformed results for downstream consumption
The output format should support efficient analytical reads
The pipeline should be idempotent (safe to rerun without producing duplicate data)
Controlling the number of output files is important for downstream tooling
Schema metadata should be embedded in the output for self-describing data

Theoretical Basis

The output writing process with partition control can be modeled as:

FUNCTION write_results(dataset, output_path, num_partitions, write_mode):
    -- Step 1: Control the number of output partitions
    IF num_partitions < current_partition_count(dataset):
        dataset = coalesce(dataset, num_partitions)
        -- Note: coalesce reduces partitions WITHOUT a full shuffle
        -- (it merges existing partitions rather than redistributing data)

    -- Step 2: Determine write behavior
    IF write_mode == "overwrite":
        delete_if_exists(output_path)
    ELSE IF write_mode == "error" AND exists(output_path):
        RAISE FileExistsError

    -- Step 3: Write data in columnar format
    FOR EACH partition IN dataset.partitions:
        file_path = output_path + "/part-" + partition.index + ".parquet"
        write_parquet_file(file_path, partition.data)

    -- Step 4: Write success marker
    write_file(output_path + "/_SUCCESS", empty)

    RETURN output_path

write_results(aggregated_data, "/output/revenue/", num_partitions=1, write_mode="overwrite")

The distinction between coalesce and repartition is important: coalesce reduces the number of partitions by merging adjacent partitions without a full data shuffle across the network. This is more efficient than repartition when reducing partition count, but it may result in uneven partition sizes. Since the goal here is to produce a single output file from an already-small aggregated dataset, coalesce is the appropriate choice.

The _SUCCESS marker file written after all data files are complete is a convention used by distributed frameworks to signal that the write operation completed successfully. Downstream readers can check for this file to confirm data integrity.

Related Pages

Implementation:DataTalksClub_Data_engineering_zoomcamp_Spark_Write_Parquet

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment