Principle:Heibaiying BigData Notes Spark Data Loading

Knowledge Sources	Apache Spark SQL Guide
Domains	Data_Analysis, Big_Data
Last Updated	2026-02-10 10:00 GMT

Overview

Spark's DataSource API provides a unified, extensible interface for reading structured data from diverse external formats and storage systems into DataFrames.

Description

Loading external data is the foundational step that brings raw information into Spark's distributed computation framework. Spark SQL achieves this through the DataFrameReader API, accessible via spark.read. This API follows a consistent pattern regardless of the underlying data format:

Specify the format (e.g., parquet, json, csv, orc, jdbc, text)
Set format-specific options (e.g., header, inferSchema, delimiter, JDBC URL)
Call load() with the path to the data source

This uniform interface means that switching between data formats requires changing only the format string and relevant options, not the overall program structure. Spark also provides convenience methods (spark.read.json(), spark.read.csv(), spark.read.parquet()) that combine the format specification and load step.

The key formats supported natively by Spark include:

Format	Description	Schema Handling
Parquet	Columnar binary format, default in Spark	Schema embedded in file metadata
JSON	Line-delimited JSON records	Automatically inferred from data
CSV	Comma-separated (or custom-delimited) text	Optionally inferred; header row supported
ORC	Optimized Row Columnar format (Hive ecosystem)	Schema embedded in file metadata
JDBC	Relational database access via JDBC driver	Schema derived from database table metadata
Text	Plain text, one line per row	Single string column named "value"

Usage

Data loading is used whenever an application needs to ingest external data for analysis, transformation, or enrichment. Common scenarios include:

Batch ETL pipelines that read daily data drops in Parquet or CSV format
Ad-hoc exploration of JSON log files or CSV exports
Database integration that pulls relational table data via JDBC for joining with file-based datasets
Data lake queries that read from distributed file systems (HDFS, S3, ADLS)

Theoretical Basis

The DataSource API is built on a pluggable provider architecture. When Spark encounters a format string, it resolves it to a DataSourceRegister implementation that knows how to:

Infer or accept a schema -- either by scanning sample data (for JSON, CSV) or reading embedded metadata (for Parquet, ORC)
Plan physical reads -- by generating scan operators that can push down filters and select only required columns (predicate pushdown and column pruning)
Produce an RDD of InternalRow -- the low-level data representation that feeds into Spark's Catalyst optimizer pipeline

// General pattern for loading data
val df = spark.read
  .format("csv")                    // specify the data format
  .option("header", "true")         // format-specific option
  .option("inferSchema", "true")    // let Spark deduce column types
  .option("sep", ",")               // field delimiter
  .load("/path/to/data.csv")        // path to the data

// Equivalent convenience method
val dfAlt = spark.read
  .option("header", "true")
  .option("inferSchema", "true")
  .csv("/path/to/data.csv")

Schema inference is convenient for exploration but has costs: Spark must scan part of the data to determine types, which adds latency. For production pipelines, explicitly specifying a StructType schema is recommended:

import org.apache.spark.sql.types._

val schema = StructType(Seq(
  StructField("id", IntegerType, nullable = false),
  StructField("name", StringType, nullable = true),
  StructField("salary", DoubleType, nullable = true)
))

val df = spark.read
  .schema(schema)
  .csv("/path/to/employees.csv")

Predicate pushdown and column pruning are optimizations where Spark pushes filter conditions and column selections down to the data source layer. For columnar formats like Parquet, this means only the required columns and matching row groups are read from disk, significantly reducing I/O.

Related Pages

Implemented By

Implementation:Heibaiying_BigData_Notes_Spark_Read_External_Data

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment