Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Heibaiying BigData Notes Spark Data Writing

From Leeroopedia


Knowledge Sources
Domains Data_Analysis, Big_Data
Last Updated 2026-02-10 10:00 GMT

Overview

Spark's DataFrameWriter API provides a unified interface for persisting DataFrame results to external storage systems in multiple formats and with configurable save modes.

Description

After data has been loaded, transformed, aggregated, and joined, the final step in most Spark SQL pipelines is writing the results to an external storage system. The DataFrameWriter API, accessed via df.write, mirrors the DataFrameReader in structure and follows the same consistent pattern:

  1. Specify the format (parquet, json, csv, orc, jdbc, text)
  2. Specify the save mode (what to do if the output already exists)
  3. Set format-specific options
  4. Call save() with the output path (or use a convenience method)

Save Modes

Save modes control how Spark handles pre-existing data at the target location:

Save Mode Scala Constant String Value Behavior
Append SaveMode.Append "append" Add new data to existing data at the target
Overwrite SaveMode.Overwrite "overwrite" Delete all existing data and replace with new data
ErrorIfExists SaveMode.ErrorIfExists "error" / "errorifexists" Throw an exception if data already exists (default)
Ignore SaveMode.Ignore "ignore" Silently do nothing if data already exists

The default save mode is ErrorIfExists, which protects against accidental data overwrites.

Output Formats

The DataFrameWriter supports the same formats as the DataFrameReader:

  • Parquet -- the recommended default for Spark; columnar, compressed, schema-preserving
  • JSON -- human-readable, line-delimited JSON records
  • CSV -- widely compatible text format with configurable delimiters and header options
  • ORC -- optimized columnar format for Hive ecosystem compatibility
  • JDBC -- write directly to relational database tables
  • Text -- plain text output (requires a single string column)

Usage

Data writing is the final step in analytical pipelines:

  • ETL output -- writing cleaned and transformed data to a data lake in Parquet format
  • Report generation -- exporting aggregated results to CSV for downstream consumption by BI tools
  • Database loading -- writing computed results to a relational database via JDBC
  • Pipeline staging -- writing intermediate results to Parquet for consumption by downstream Spark jobs
  • Data archival -- persisting snapshots with Append mode for historical tracking

Theoretical Basis

The DataFrameWriter mirrors the DataFrameReader's pluggable architecture. When save() is called, Spark:

  1. Evaluates the logical plan -- triggering the full Catalyst optimization pipeline and executing the query
  2. Materializes the results -- collecting output rows from executor tasks
  3. Writes through the format provider -- each output format has a corresponding writer implementation that handles serialization and file management
// General pattern for writing data
df.write
  .format("parquet")
  .mode(SaveMode.Overwrite)
  .option("compression", "snappy")
  .save("/output/path/results.parquet")

// Equivalent convenience method
df.write
  .mode("overwrite")
  .parquet("/output/path/results.parquet")

Partitioning is a critical concept for write performance and downstream query efficiency:

// Partition output by year and month for efficient date-range queries
df.write
  .partitionBy("year", "month")
  .mode(SaveMode.Overwrite)
  .parquet("/output/path/events/")

// This creates a directory structure:
// /output/path/events/year=2025/month=01/part-00000.parquet
// /output/path/events/year=2025/month=02/part-00000.parquet
// ...

Bucketing is another optimization that pre-sorts and hashes data into a fixed number of buckets, enabling more efficient joins and aggregations on the bucketed columns:

// Bucket by userId for efficient joins on that column
df.write
  .bucketBy(16, "userId")
  .sortBy("timestamp")
  .mode(SaveMode.Overwrite)
  .saveAsTable("events_bucketed")

Key considerations:

  • Overwrite mode deletes the entire target directory before writing; use with caution in production
  • Append mode can create many small files over time, which may degrade read performance (consider periodic compaction)
  • The number of output files is determined by the number of partitions in the DataFrame; use coalesce() or repartition() to control this
  • JDBC writes support batch inserts and truncate options for efficient database loading

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment