Principle:Heibaiying BigData Notes Spark Data Writing
| Knowledge Sources | |
|---|---|
| Domains | Data_Analysis, Big_Data |
| Last Updated | 2026-02-10 10:00 GMT |
Overview
Spark's DataFrameWriter API provides a unified interface for persisting DataFrame results to external storage systems in multiple formats and with configurable save modes.
Description
After data has been loaded, transformed, aggregated, and joined, the final step in most Spark SQL pipelines is writing the results to an external storage system. The DataFrameWriter API, accessed via df.write, mirrors the DataFrameReader in structure and follows the same consistent pattern:
- Specify the format (parquet, json, csv, orc, jdbc, text)
- Specify the save mode (what to do if the output already exists)
- Set format-specific options
- Call save() with the output path (or use a convenience method)
Save Modes
Save modes control how Spark handles pre-existing data at the target location:
| Save Mode | Scala Constant | String Value | Behavior |
|---|---|---|---|
| Append | SaveMode.Append | "append" | Add new data to existing data at the target |
| Overwrite | SaveMode.Overwrite | "overwrite" | Delete all existing data and replace with new data |
| ErrorIfExists | SaveMode.ErrorIfExists | "error" / "errorifexists" | Throw an exception if data already exists (default) |
| Ignore | SaveMode.Ignore | "ignore" | Silently do nothing if data already exists |
The default save mode is ErrorIfExists, which protects against accidental data overwrites.
Output Formats
The DataFrameWriter supports the same formats as the DataFrameReader:
- Parquet -- the recommended default for Spark; columnar, compressed, schema-preserving
- JSON -- human-readable, line-delimited JSON records
- CSV -- widely compatible text format with configurable delimiters and header options
- ORC -- optimized columnar format for Hive ecosystem compatibility
- JDBC -- write directly to relational database tables
- Text -- plain text output (requires a single string column)
Usage
Data writing is the final step in analytical pipelines:
- ETL output -- writing cleaned and transformed data to a data lake in Parquet format
- Report generation -- exporting aggregated results to CSV for downstream consumption by BI tools
- Database loading -- writing computed results to a relational database via JDBC
- Pipeline staging -- writing intermediate results to Parquet for consumption by downstream Spark jobs
- Data archival -- persisting snapshots with Append mode for historical tracking
Theoretical Basis
The DataFrameWriter mirrors the DataFrameReader's pluggable architecture. When save() is called, Spark:
- Evaluates the logical plan -- triggering the full Catalyst optimization pipeline and executing the query
- Materializes the results -- collecting output rows from executor tasks
- Writes through the format provider -- each output format has a corresponding writer implementation that handles serialization and file management
// General pattern for writing data
df.write
.format("parquet")
.mode(SaveMode.Overwrite)
.option("compression", "snappy")
.save("/output/path/results.parquet")
// Equivalent convenience method
df.write
.mode("overwrite")
.parquet("/output/path/results.parquet")
Partitioning is a critical concept for write performance and downstream query efficiency:
// Partition output by year and month for efficient date-range queries
df.write
.partitionBy("year", "month")
.mode(SaveMode.Overwrite)
.parquet("/output/path/events/")
// This creates a directory structure:
// /output/path/events/year=2025/month=01/part-00000.parquet
// /output/path/events/year=2025/month=02/part-00000.parquet
// ...
Bucketing is another optimization that pre-sorts and hashes data into a fixed number of buckets, enabling more efficient joins and aggregations on the bucketed columns:
// Bucket by userId for efficient joins on that column
df.write
.bucketBy(16, "userId")
.sortBy("timestamp")
.mode(SaveMode.Overwrite)
.saveAsTable("events_bucketed")
Key considerations:
- Overwrite mode deletes the entire target directory before writing; use with caution in production
- Append mode can create many small files over time, which may degrade read performance (consider periodic compaction)
- The number of output files is determined by the number of partitions in the DataFrame; use coalesce() or repartition() to control this
- JDBC writes support batch inserts and truncate options for efficient database loading