Principle:Heibaiying BigData Notes Spark Data Loading
| Knowledge Sources | |
|---|---|
| Domains | Data_Analysis, Big_Data |
| Last Updated | 2026-02-10 10:00 GMT |
Overview
Spark's DataSource API provides a unified, extensible interface for reading structured data from diverse external formats and storage systems into DataFrames.
Description
Loading external data is the foundational step that brings raw information into Spark's distributed computation framework. Spark SQL achieves this through the DataFrameReader API, accessible via spark.read. This API follows a consistent pattern regardless of the underlying data format:
- Specify the format (e.g., parquet, json, csv, orc, jdbc, text)
- Set format-specific options (e.g., header, inferSchema, delimiter, JDBC URL)
- Call load() with the path to the data source
This uniform interface means that switching between data formats requires changing only the format string and relevant options, not the overall program structure. Spark also provides convenience methods (spark.read.json(), spark.read.csv(), spark.read.parquet()) that combine the format specification and load step.
The key formats supported natively by Spark include:
| Format | Description | Schema Handling |
|---|---|---|
| Parquet | Columnar binary format, default in Spark | Schema embedded in file metadata |
| JSON | Line-delimited JSON records | Automatically inferred from data |
| CSV | Comma-separated (or custom-delimited) text | Optionally inferred; header row supported |
| ORC | Optimized Row Columnar format (Hive ecosystem) | Schema embedded in file metadata |
| JDBC | Relational database access via JDBC driver | Schema derived from database table metadata |
| Text | Plain text, one line per row | Single string column named "value" |
Usage
Data loading is used whenever an application needs to ingest external data for analysis, transformation, or enrichment. Common scenarios include:
- Batch ETL pipelines that read daily data drops in Parquet or CSV format
- Ad-hoc exploration of JSON log files or CSV exports
- Database integration that pulls relational table data via JDBC for joining with file-based datasets
- Data lake queries that read from distributed file systems (HDFS, S3, ADLS)
Theoretical Basis
The DataSource API is built on a pluggable provider architecture. When Spark encounters a format string, it resolves it to a DataSourceRegister implementation that knows how to:
- Infer or accept a schema -- either by scanning sample data (for JSON, CSV) or reading embedded metadata (for Parquet, ORC)
- Plan physical reads -- by generating scan operators that can push down filters and select only required columns (predicate pushdown and column pruning)
- Produce an RDD of InternalRow -- the low-level data representation that feeds into Spark's Catalyst optimizer pipeline
// General pattern for loading data
val df = spark.read
.format("csv") // specify the data format
.option("header", "true") // format-specific option
.option("inferSchema", "true") // let Spark deduce column types
.option("sep", ",") // field delimiter
.load("/path/to/data.csv") // path to the data
// Equivalent convenience method
val dfAlt = spark.read
.option("header", "true")
.option("inferSchema", "true")
.csv("/path/to/data.csv")
Schema inference is convenient for exploration but has costs: Spark must scan part of the data to determine types, which adds latency. For production pipelines, explicitly specifying a StructType schema is recommended:
import org.apache.spark.sql.types._
val schema = StructType(Seq(
StructField("id", IntegerType, nullable = false),
StructField("name", StringType, nullable = true),
StructField("salary", DoubleType, nullable = true)
))
val df = spark.read
.schema(schema)
.csv("/path/to/employees.csv")
Predicate pushdown and column pruning are optimizations where Spark pushes filter conditions and column selections down to the data source layer. For columnar formats like Parquet, this means only the required columns and matching row groups are read from disk, significantly reducing I/O.