Principle:Heibaiying BigData Notes Spark Data Writing

Knowledge Sources	Apache Spark SQL Guide
Domains	Data_Analysis, Big_Data
Last Updated	2026-02-10 10:00 GMT

Overview

Spark's DataFrameWriter API provides a unified interface for persisting DataFrame results to external storage systems in multiple formats and with configurable save modes.

Description

After data has been loaded, transformed, aggregated, and joined, the final step in most Spark SQL pipelines is writing the results to an external storage system. The DataFrameWriter API, accessed via df.write, mirrors the DataFrameReader in structure and follows the same consistent pattern:

Specify the format (parquet, json, csv, orc, jdbc, text)
Specify the save mode (what to do if the output already exists)
Set format-specific options
Call save() with the output path (or use a convenience method)

Save Modes

Save modes control how Spark handles pre-existing data at the target location:

Save Mode	Scala Constant	String Value	Behavior
Append	SaveMode.Append	"append"	Add new data to existing data at the target
Overwrite	SaveMode.Overwrite	"overwrite"	Delete all existing data and replace with new data
ErrorIfExists	SaveMode.ErrorIfExists	"error" / "errorifexists"	Throw an exception if data already exists (default)
Ignore	SaveMode.Ignore	"ignore"	Silently do nothing if data already exists

The default save mode is ErrorIfExists, which protects against accidental data overwrites.

Output Formats

The DataFrameWriter supports the same formats as the DataFrameReader:

Parquet -- the recommended default for Spark; columnar, compressed, schema-preserving
JSON -- human-readable, line-delimited JSON records
CSV -- widely compatible text format with configurable delimiters and header options
ORC -- optimized columnar format for Hive ecosystem compatibility
JDBC -- write directly to relational database tables
Text -- plain text output (requires a single string column)

Usage

Data writing is the final step in analytical pipelines:

ETL output -- writing cleaned and transformed data to a data lake in Parquet format
Report generation -- exporting aggregated results to CSV for downstream consumption by BI tools
Database loading -- writing computed results to a relational database via JDBC
Pipeline staging -- writing intermediate results to Parquet for consumption by downstream Spark jobs
Data archival -- persisting snapshots with Append mode for historical tracking

Theoretical Basis

The DataFrameWriter mirrors the DataFrameReader's pluggable architecture. When save() is called, Spark:

Evaluates the logical plan -- triggering the full Catalyst optimization pipeline and executing the query
Materializes the results -- collecting output rows from executor tasks
Writes through the format provider -- each output format has a corresponding writer implementation that handles serialization and file management

// General pattern for writing data
df.write
  .format("parquet")
  .mode(SaveMode.Overwrite)
  .option("compression", "snappy")
  .save("/output/path/results.parquet")

// Equivalent convenience method
df.write
  .mode("overwrite")
  .parquet("/output/path/results.parquet")

Partitioning is a critical concept for write performance and downstream query efficiency:

// Partition output by year and month for efficient date-range queries
df.write
  .partitionBy("year", "month")
  .mode(SaveMode.Overwrite)
  .parquet("/output/path/events/")

// This creates a directory structure:
// /output/path/events/year=2025/month=01/part-00000.parquet
// /output/path/events/year=2025/month=02/part-00000.parquet
// ...

Bucketing is another optimization that pre-sorts and hashes data into a fixed number of buckets, enabling more efficient joins and aggregations on the bucketed columns:

// Bucket by userId for efficient joins on that column
df.write
  .bucketBy(16, "userId")
  .sortBy("timestamp")
  .mode(SaveMode.Overwrite)
  .saveAsTable("events_bucketed")

Key considerations:

Overwrite mode deletes the entire target directory before writing; use with caution in production
Append mode can create many small files over time, which may degrade read performance (consider periodic compaction)
The number of output files is determined by the number of partitions in the DataFrame; use coalesce() or repartition() to control this
JDBC writes support batch inserts and truncate options for efficient database loading

Related Pages

Implemented By

Implementation:Heibaiying_BigData_Notes_Spark_Write_External_Data

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment