Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Heibaiying BigData Notes Spark Read External Data

From Leeroopedia


Knowledge Sources
Domains Data_Analysis, Big_Data
Last Updated 2026-02-10 10:00 GMT

Overview

Concrete tool for loading data from external sources into Spark DataFrames provided by Apache Spark.

Description

The DataFrameReader API, accessed through spark.read, provides both a generic format/load interface and format-specific convenience methods for reading structured data. The BigData-Notes repository documents reading from JSON, CSV, Parquet, ORC, text files, and JDBC databases, with detailed option explanations for each format.

The API supports:

  • Generic reading: spark.read.format("json").option(k, v).load(path)
  • Convenience methods: spark.read.json(path), spark.read.csv(path), spark.read.parquet(path)
  • Schema specification: explicitly providing a StructType or relying on automatic inference
  • Multi-path reading: passing multiple paths to load data from several files at once

Usage

Use the DataFrameReader whenever you need to bring external data into Spark for analysis. Choose the format that matches your source data. Use option() to control format-specific behaviors such as header parsing, schema inference, delimiters, and JDBC connection parameters.

Code Reference

Source Location

  • Repository file: notes/SparkSQL外部数据源.md (lines 1-502)
  • External class: org.apache.spark.sql.DataFrameReader
  • External documentation: DataFrameReader Scaladoc

Signature

// Generic format/load pattern
spark.read
  .format(source: String)
  .schema(schema: StructType)          // optional
  .option(key: String, value: String)  // repeatable
  .load(path: String): DataFrame

// Convenience methods
spark.read.json(paths: String*): DataFrame
spark.read.csv(paths: String*): DataFrame
spark.read.parquet(paths: String*): DataFrame
spark.read.orc(paths: String*): DataFrame
spark.read.text(paths: String*): DataFrame

Import

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types._

I/O Contract

Inputs

Name Type Required Description
format String Yes (generic) Data source format identifier: "json", "csv", "parquet", "orc", "jdbc", "text"
path String Yes (file-based) File system path to the data (local, HDFS, S3)
header String ("true"/"false") No (CSV) Whether the first row is a header containing column names
inferSchema String ("true"/"false") No (CSV/JSON) Whether Spark should infer column data types from the data
sep String No (CSV) Field delimiter character (default: comma)
url String Yes (JDBC) JDBC connection URL (e.g., "jdbc:mysql://host:3306/db")
dbtable String Yes (JDBC) Database table name or subquery to read
driver String Yes (JDBC) Fully qualified JDBC driver class name
user String No (JDBC) Database username for authentication
password String No (JDBC) Database password for authentication

Outputs

Name Type Description
DataFrame org.apache.spark.sql.DataFrame A distributed collection of rows organized into named columns, representing the loaded data

Usage Examples

import org.apache.spark.sql.SparkSession

val spark = SparkSession.builder()
  .appName("Data-Loading-Examples")
  .master("local[*]")
  .getOrCreate()

// --- Read JSON ---
val jsonDF = spark.read.json("/data/employees.json")
jsonDF.printSchema()
jsonDF.show()

// --- Read CSV with header and schema inference ---
val csvDF = spark.read
  .format("csv")
  .option("header", "true")
  .option("inferSchema", "true")
  .option("sep", ",")
  .load("/data/sales.csv")

// --- Read Parquet (Spark's default format) ---
val parquetDF = spark.read.parquet("/data/events.parquet")

// --- Read ORC ---
val orcDF = spark.read.orc("/data/logs.orc")

// --- Read from JDBC (MySQL example) ---
val jdbcDF = spark.read
  .format("jdbc")
  .option("url", "jdbc:mysql://localhost:3306/mydb")
  .option("dbtable", "employees")
  .option("driver", "com.mysql.cj.jdbc.Driver")
  .option("user", "root")
  .option("password", "secret")
  .load()

// --- Read plain text ---
val textDF = spark.read.text("/data/readme.txt")
textDF.show(truncate = false)

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment