Implementation:Heibaiying BigData Notes Spark Write External Data
| Knowledge Sources | |
|---|---|
| Domains | Data_Analysis, Big_Data |
| Last Updated | 2026-02-10 10:00 GMT |
Overview
Concrete tool for writing DataFrame results to external storage systems provided by Apache Spark.
Description
The DataFrameWriter API, accessed via df.write, provides methods to persist DataFrame contents in various formats with configurable save modes. The BigData-Notes repository documents the complete write API in the external data sources guide, covering format specification, save mode selection, option configuration, and output path handling.
The API supports:
- Generic writing:
df.write.format("json").mode("overwrite").save(path) - Convenience methods:
df.write.parquet(path),df.write.json(path),df.write.csv(path) - Save modes: Append, Overwrite, ErrorIfExists, Ignore
- Partitioning:
df.write.partitionBy("col").save(path)for directory-based partitioning - JDBC writing:
df.write.format("jdbc").option(...).save()for database output
Usage
Use the DataFrameWriter as the final step in a Spark SQL pipeline to persist results. Choose the save mode based on whether you want to append to, overwrite, or protect existing data. Select the output format based on the requirements of downstream consumers (Parquet for Spark pipelines, CSV for BI tools, JDBC for databases).
Code Reference
Source Location
- Repository file:
notes/SparkSQL外部数据源.md(lines 1-502) - External classes:
org.apache.spark.sql.DataFrameWriter,org.apache.spark.sql.SaveMode - External documentation: DataFrameWriter Scaladoc
Signature
// Generic format/save pattern
df.write
.format(source: String)
.mode(saveMode: SaveMode) // or mode(saveMode: String)
.option(key: String, value: String) // repeatable
.partitionBy(colNames: String*) // optional
.save(path: String): Unit
// Convenience methods
df.write.parquet(path: String): Unit
df.write.json(path: String): Unit
df.write.csv(path: String): Unit
df.write.orc(path: String): Unit
df.write.text(path: String): Unit
// Table output
df.write.saveAsTable(tableName: String): Unit
Import
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.SaveMode
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| df | DataFrame | Yes | The DataFrame whose contents will be written |
| format | String | Yes (generic) | Output format: "parquet", "json", "csv", "orc", "jdbc", "text" |
| saveMode | SaveMode or String | No (default: ErrorIfExists) | How to handle existing data: Append, Overwrite, ErrorIfExists, Ignore |
| path | String | Yes (file-based) | Output file system path (local, HDFS, S3) |
| partitionBy | String* | No | Column names to partition the output directory structure by |
| header | String ("true"/"false") | No (CSV) | Whether to write a header row with column names |
| sep | String | No (CSV) | Field delimiter character for CSV output |
| compression | String | No | Compression codec: "snappy", "gzip", "lz4", "none" |
| url | String | Yes (JDBC) | JDBC connection URL for database output |
| dbtable | String | Yes (JDBC) | Target database table name |
| driver | String | Yes (JDBC) | Fully qualified JDBC driver class name |
Outputs
| Name | Type | Description |
|---|---|---|
| Unit | Unit | The write operation returns nothing; it persists data as a side effect to the specified storage system |
| Files | Files on disk | One or more output files in the specified format at the target path |
Usage Examples
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.SaveMode
val spark = SparkSession.builder()
.appName("Write-Examples")
.master("local[*]")
.getOrCreate()
import spark.implicits._
// Sample DataFrame
val resultDF = Seq(
(1, "Alice", "Engineering", 90000, 2025),
(2, "Bob", "Marketing", 75000, 2025),
(3, "Carol", "Engineering", 95000, 2024),
(4, "Dave", "Marketing", 70000, 2024)
).toDF("id", "name", "department", "salary", "year")
// --- Write as Parquet (default format, with Overwrite mode) ---
resultDF.write
.mode(SaveMode.Overwrite)
.parquet("/output/results.parquet")
// --- Write as JSON ---
resultDF.write
.mode("overwrite")
.json("/output/results.json")
// --- Write as CSV with header ---
resultDF.write
.format("csv")
.mode(SaveMode.Overwrite)
.option("header", "true")
.option("sep", ",")
.save("/output/results.csv")
// --- Write as ORC ---
resultDF.write
.mode(SaveMode.Overwrite)
.orc("/output/results.orc")
// --- Write with partitioning ---
resultDF.write
.partitionBy("year", "department")
.mode(SaveMode.Overwrite)
.parquet("/output/partitioned_results/")
// --- Write to JDBC (MySQL example) ---
resultDF.write
.format("jdbc")
.mode(SaveMode.Append)
.option("url", "jdbc:mysql://localhost:3306/mydb")
.option("dbtable", "employee_results")
.option("driver", "com.mysql.cj.jdbc.Driver")
.option("user", "root")
.option("password", "secret")
.save()
// --- Control number of output files ---
resultDF.coalesce(1) // single output file
.write
.mode(SaveMode.Overwrite)
.json("/output/single_file_result.json")
// --- Append mode (add to existing data) ---
val newData = Seq(
(5, "Eve", "Sales", 88000, 2025)
).toDF("id", "name", "department", "salary", "year")
newData.write
.mode(SaveMode.Append)
.parquet("/output/results.parquet")
// --- Ignore mode (skip if exists) ---
resultDF.write
.mode(SaveMode.Ignore)
.parquet("/output/results.parquet")