Implementation:Heibaiying BigData Notes Spark Read External Data
| Knowledge Sources | |
|---|---|
| Domains | Data_Analysis, Big_Data |
| Last Updated | 2026-02-10 10:00 GMT |
Overview
Concrete tool for loading data from external sources into Spark DataFrames provided by Apache Spark.
Description
The DataFrameReader API, accessed through spark.read, provides both a generic format/load interface and format-specific convenience methods for reading structured data. The BigData-Notes repository documents reading from JSON, CSV, Parquet, ORC, text files, and JDBC databases, with detailed option explanations for each format.
The API supports:
- Generic reading:
spark.read.format("json").option(k, v).load(path) - Convenience methods:
spark.read.json(path),spark.read.csv(path),spark.read.parquet(path) - Schema specification: explicitly providing a StructType or relying on automatic inference
- Multi-path reading: passing multiple paths to load data from several files at once
Usage
Use the DataFrameReader whenever you need to bring external data into Spark for analysis. Choose the format that matches your source data. Use option() to control format-specific behaviors such as header parsing, schema inference, delimiters, and JDBC connection parameters.
Code Reference
Source Location
- Repository file:
notes/SparkSQL外部数据源.md(lines 1-502) - External class:
org.apache.spark.sql.DataFrameReader - External documentation: DataFrameReader Scaladoc
Signature
// Generic format/load pattern
spark.read
.format(source: String)
.schema(schema: StructType) // optional
.option(key: String, value: String) // repeatable
.load(path: String): DataFrame
// Convenience methods
spark.read.json(paths: String*): DataFrame
spark.read.csv(paths: String*): DataFrame
spark.read.parquet(paths: String*): DataFrame
spark.read.orc(paths: String*): DataFrame
spark.read.text(paths: String*): DataFrame
Import
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types._
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| format | String | Yes (generic) | Data source format identifier: "json", "csv", "parquet", "orc", "jdbc", "text" |
| path | String | Yes (file-based) | File system path to the data (local, HDFS, S3) |
| header | String ("true"/"false") | No (CSV) | Whether the first row is a header containing column names |
| inferSchema | String ("true"/"false") | No (CSV/JSON) | Whether Spark should infer column data types from the data |
| sep | String | No (CSV) | Field delimiter character (default: comma) |
| url | String | Yes (JDBC) | JDBC connection URL (e.g., "jdbc:mysql://host:3306/db") |
| dbtable | String | Yes (JDBC) | Database table name or subquery to read |
| driver | String | Yes (JDBC) | Fully qualified JDBC driver class name |
| user | String | No (JDBC) | Database username for authentication |
| password | String | No (JDBC) | Database password for authentication |
Outputs
| Name | Type | Description |
|---|---|---|
| DataFrame | org.apache.spark.sql.DataFrame | A distributed collection of rows organized into named columns, representing the loaded data |
Usage Examples
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder()
.appName("Data-Loading-Examples")
.master("local[*]")
.getOrCreate()
// --- Read JSON ---
val jsonDF = spark.read.json("/data/employees.json")
jsonDF.printSchema()
jsonDF.show()
// --- Read CSV with header and schema inference ---
val csvDF = spark.read
.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.option("sep", ",")
.load("/data/sales.csv")
// --- Read Parquet (Spark's default format) ---
val parquetDF = spark.read.parquet("/data/events.parquet")
// --- Read ORC ---
val orcDF = spark.read.orc("/data/logs.orc")
// --- Read from JDBC (MySQL example) ---
val jdbcDF = spark.read
.format("jdbc")
.option("url", "jdbc:mysql://localhost:3306/mydb")
.option("dbtable", "employees")
.option("driver", "com.mysql.cj.jdbc.Driver")
.option("user", "root")
.option("password", "secret")
.load()
// --- Read plain text ---
val textDF = spark.read.text("/data/readme.txt")
textDF.show(truncate = false)