Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Heibaiying BigData Notes Spark Write External Data

From Leeroopedia


Knowledge Sources
Domains Data_Analysis, Big_Data
Last Updated 2026-02-10 10:00 GMT

Overview

Concrete tool for writing DataFrame results to external storage systems provided by Apache Spark.

Description

The DataFrameWriter API, accessed via df.write, provides methods to persist DataFrame contents in various formats with configurable save modes. The BigData-Notes repository documents the complete write API in the external data sources guide, covering format specification, save mode selection, option configuration, and output path handling.

The API supports:

  • Generic writing: df.write.format("json").mode("overwrite").save(path)
  • Convenience methods: df.write.parquet(path), df.write.json(path), df.write.csv(path)
  • Save modes: Append, Overwrite, ErrorIfExists, Ignore
  • Partitioning: df.write.partitionBy("col").save(path) for directory-based partitioning
  • JDBC writing: df.write.format("jdbc").option(...).save() for database output

Usage

Use the DataFrameWriter as the final step in a Spark SQL pipeline to persist results. Choose the save mode based on whether you want to append to, overwrite, or protect existing data. Select the output format based on the requirements of downstream consumers (Parquet for Spark pipelines, CSV for BI tools, JDBC for databases).

Code Reference

Source Location

  • Repository file: notes/SparkSQL外部数据源.md (lines 1-502)
  • External classes: org.apache.spark.sql.DataFrameWriter, org.apache.spark.sql.SaveMode
  • External documentation: DataFrameWriter Scaladoc

Signature

// Generic format/save pattern
df.write
  .format(source: String)
  .mode(saveMode: SaveMode)             // or mode(saveMode: String)
  .option(key: String, value: String)   // repeatable
  .partitionBy(colNames: String*)       // optional
  .save(path: String): Unit

// Convenience methods
df.write.parquet(path: String): Unit
df.write.json(path: String): Unit
df.write.csv(path: String): Unit
df.write.orc(path: String): Unit
df.write.text(path: String): Unit

// Table output
df.write.saveAsTable(tableName: String): Unit

Import

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.SaveMode

I/O Contract

Inputs

Name Type Required Description
df DataFrame Yes The DataFrame whose contents will be written
format String Yes (generic) Output format: "parquet", "json", "csv", "orc", "jdbc", "text"
saveMode SaveMode or String No (default: ErrorIfExists) How to handle existing data: Append, Overwrite, ErrorIfExists, Ignore
path String Yes (file-based) Output file system path (local, HDFS, S3)
partitionBy String* No Column names to partition the output directory structure by
header String ("true"/"false") No (CSV) Whether to write a header row with column names
sep String No (CSV) Field delimiter character for CSV output
compression String No Compression codec: "snappy", "gzip", "lz4", "none"
url String Yes (JDBC) JDBC connection URL for database output
dbtable String Yes (JDBC) Target database table name
driver String Yes (JDBC) Fully qualified JDBC driver class name

Outputs

Name Type Description
Unit Unit The write operation returns nothing; it persists data as a side effect to the specified storage system
Files Files on disk One or more output files in the specified format at the target path

Usage Examples

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.SaveMode

val spark = SparkSession.builder()
  .appName("Write-Examples")
  .master("local[*]")
  .getOrCreate()

import spark.implicits._

// Sample DataFrame
val resultDF = Seq(
  (1, "Alice", "Engineering", 90000, 2025),
  (2, "Bob", "Marketing", 75000, 2025),
  (3, "Carol", "Engineering", 95000, 2024),
  (4, "Dave", "Marketing", 70000, 2024)
).toDF("id", "name", "department", "salary", "year")

// --- Write as Parquet (default format, with Overwrite mode) ---
resultDF.write
  .mode(SaveMode.Overwrite)
  .parquet("/output/results.parquet")

// --- Write as JSON ---
resultDF.write
  .mode("overwrite")
  .json("/output/results.json")

// --- Write as CSV with header ---
resultDF.write
  .format("csv")
  .mode(SaveMode.Overwrite)
  .option("header", "true")
  .option("sep", ",")
  .save("/output/results.csv")

// --- Write as ORC ---
resultDF.write
  .mode(SaveMode.Overwrite)
  .orc("/output/results.orc")

// --- Write with partitioning ---
resultDF.write
  .partitionBy("year", "department")
  .mode(SaveMode.Overwrite)
  .parquet("/output/partitioned_results/")

// --- Write to JDBC (MySQL example) ---
resultDF.write
  .format("jdbc")
  .mode(SaveMode.Append)
  .option("url", "jdbc:mysql://localhost:3306/mydb")
  .option("dbtable", "employee_results")
  .option("driver", "com.mysql.cj.jdbc.Driver")
  .option("user", "root")
  .option("password", "secret")
  .save()

// --- Control number of output files ---
resultDF.coalesce(1)   // single output file
  .write
  .mode(SaveMode.Overwrite)
  .json("/output/single_file_result.json")

// --- Append mode (add to existing data) ---
val newData = Seq(
  (5, "Eve", "Sales", 88000, 2025)
).toDF("id", "name", "department", "salary", "year")

newData.write
  .mode(SaveMode.Append)
  .parquet("/output/results.parquet")

// --- Ignore mode (skip if exists) ---
resultDF.write
  .mode(SaveMode.Ignore)
  .parquet("/output/results.parquet")

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment