Implementation:Heibaiying BigData Notes Spark Write External Data

Knowledge Sources	BigData-Notes Spark SQL API
Domains	Data_Analysis, Big_Data
Last Updated	2026-02-10 10:00 GMT

Overview

Concrete tool for writing DataFrame results to external storage systems provided by Apache Spark.

Description

The DataFrameWriter API, accessed via df.write, provides methods to persist DataFrame contents in various formats with configurable save modes. The BigData-Notes repository documents the complete write API in the external data sources guide, covering format specification, save mode selection, option configuration, and output path handling.

The API supports:

Generic writing: df.write.format("json").mode("overwrite").save(path)
Convenience methods: df.write.parquet(path), df.write.json(path), df.write.csv(path)
Save modes: Append, Overwrite, ErrorIfExists, Ignore
Partitioning: df.write.partitionBy("col").save(path) for directory-based partitioning
JDBC writing: df.write.format("jdbc").option(...).save() for database output

Usage

Use the DataFrameWriter as the final step in a Spark SQL pipeline to persist results. Choose the save mode based on whether you want to append to, overwrite, or protect existing data. Select the output format based on the requirements of downstream consumers (Parquet for Spark pipelines, CSV for BI tools, JDBC for databases).

Code Reference

Source Location

Repository file: notes/SparkSQL外部数据源.md (lines 1-502)
External classes: org.apache.spark.sql.DataFrameWriter, org.apache.spark.sql.SaveMode
External documentation: DataFrameWriter Scaladoc

Signature

// Generic format/save pattern
df.write
  .format(source: String)
  .mode(saveMode: SaveMode)             // or mode(saveMode: String)
  .option(key: String, value: String)   // repeatable
  .partitionBy(colNames: String*)       // optional
  .save(path: String): Unit

// Convenience methods
df.write.parquet(path: String): Unit
df.write.json(path: String): Unit
df.write.csv(path: String): Unit
df.write.orc(path: String): Unit
df.write.text(path: String): Unit

// Table output
df.write.saveAsTable(tableName: String): Unit

Import

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.SaveMode

I/O Contract

Inputs

Name	Type	Required	Description
df	DataFrame	Yes	The DataFrame whose contents will be written
format	String	Yes (generic)	Output format: "parquet", "json", "csv", "orc", "jdbc", "text"
saveMode	SaveMode or String	No (default: ErrorIfExists)	How to handle existing data: Append, Overwrite, ErrorIfExists, Ignore
path	String	Yes (file-based)	Output file system path (local, HDFS, S3)
partitionBy	String*	No	Column names to partition the output directory structure by
header	String ("true"/"false")	No (CSV)	Whether to write a header row with column names
sep	String	No (CSV)	Field delimiter character for CSV output
compression	String	No	Compression codec: "snappy", "gzip", "lz4", "none"
url	String	Yes (JDBC)	JDBC connection URL for database output
dbtable	String	Yes (JDBC)	Target database table name
driver	String	Yes (JDBC)	Fully qualified JDBC driver class name

Outputs

Name	Type	Description
Unit	Unit	The write operation returns nothing; it persists data as a side effect to the specified storage system
Files	Files on disk	One or more output files in the specified format at the target path

Usage Examples

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.SaveMode

val spark = SparkSession.builder()
  .appName("Write-Examples")
  .master("local[*]")
  .getOrCreate()

import spark.implicits._

// Sample DataFrame
val resultDF = Seq(
  (1, "Alice", "Engineering", 90000, 2025),
  (2, "Bob", "Marketing", 75000, 2025),
  (3, "Carol", "Engineering", 95000, 2024),
  (4, "Dave", "Marketing", 70000, 2024)
).toDF("id", "name", "department", "salary", "year")

// --- Write as Parquet (default format, with Overwrite mode) ---
resultDF.write
  .mode(SaveMode.Overwrite)
  .parquet("/output/results.parquet")

// --- Write as JSON ---
resultDF.write
  .mode("overwrite")
  .json("/output/results.json")

// --- Write as CSV with header ---
resultDF.write
  .format("csv")
  .mode(SaveMode.Overwrite)
  .option("header", "true")
  .option("sep", ",")
  .save("/output/results.csv")

// --- Write as ORC ---
resultDF.write
  .mode(SaveMode.Overwrite)
  .orc("/output/results.orc")

// --- Write with partitioning ---
resultDF.write
  .partitionBy("year", "department")
  .mode(SaveMode.Overwrite)
  .parquet("/output/partitioned_results/")

// --- Write to JDBC (MySQL example) ---
resultDF.write
  .format("jdbc")
  .mode(SaveMode.Append)
  .option("url", "jdbc:mysql://localhost:3306/mydb")
  .option("dbtable", "employee_results")
  .option("driver", "com.mysql.cj.jdbc.Driver")
  .option("user", "root")
  .option("password", "secret")
  .save()

// --- Control number of output files ---
resultDF.coalesce(1)   // single output file
  .write
  .mode(SaveMode.Overwrite)
  .json("/output/single_file_result.json")

// --- Append mode (add to existing data) ---
val newData = Seq(
  (5, "Eve", "Sales", 88000, 2025)
).toDF("id", "name", "department", "salary", "year")

newData.write
  .mode(SaveMode.Append)
  .parquet("/output/results.parquet")

// --- Ignore mode (skip if exists) ---
resultDF.write
  .mode(SaveMode.Ignore)
  .parquet("/output/results.parquet")

Related Pages

Implements Principle

Principle:Heibaiying_BigData_Notes_Spark_Data_Writing

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment