Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Heibaiying BigData Notes Spark Data Loading

From Leeroopedia


Knowledge Sources
Domains Data_Analysis, Big_Data
Last Updated 2026-02-10 10:00 GMT

Overview

Spark's DataSource API provides a unified, extensible interface for reading structured data from diverse external formats and storage systems into DataFrames.

Description

Loading external data is the foundational step that brings raw information into Spark's distributed computation framework. Spark SQL achieves this through the DataFrameReader API, accessible via spark.read. This API follows a consistent pattern regardless of the underlying data format:

  1. Specify the format (e.g., parquet, json, csv, orc, jdbc, text)
  2. Set format-specific options (e.g., header, inferSchema, delimiter, JDBC URL)
  3. Call load() with the path to the data source

This uniform interface means that switching between data formats requires changing only the format string and relevant options, not the overall program structure. Spark also provides convenience methods (spark.read.json(), spark.read.csv(), spark.read.parquet()) that combine the format specification and load step.

The key formats supported natively by Spark include:

Format Description Schema Handling
Parquet Columnar binary format, default in Spark Schema embedded in file metadata
JSON Line-delimited JSON records Automatically inferred from data
CSV Comma-separated (or custom-delimited) text Optionally inferred; header row supported
ORC Optimized Row Columnar format (Hive ecosystem) Schema embedded in file metadata
JDBC Relational database access via JDBC driver Schema derived from database table metadata
Text Plain text, one line per row Single string column named "value"

Usage

Data loading is used whenever an application needs to ingest external data for analysis, transformation, or enrichment. Common scenarios include:

  • Batch ETL pipelines that read daily data drops in Parquet or CSV format
  • Ad-hoc exploration of JSON log files or CSV exports
  • Database integration that pulls relational table data via JDBC for joining with file-based datasets
  • Data lake queries that read from distributed file systems (HDFS, S3, ADLS)

Theoretical Basis

The DataSource API is built on a pluggable provider architecture. When Spark encounters a format string, it resolves it to a DataSourceRegister implementation that knows how to:

  1. Infer or accept a schema -- either by scanning sample data (for JSON, CSV) or reading embedded metadata (for Parquet, ORC)
  2. Plan physical reads -- by generating scan operators that can push down filters and select only required columns (predicate pushdown and column pruning)
  3. Produce an RDD of InternalRow -- the low-level data representation that feeds into Spark's Catalyst optimizer pipeline
// General pattern for loading data
val df = spark.read
  .format("csv")                    // specify the data format
  .option("header", "true")         // format-specific option
  .option("inferSchema", "true")    // let Spark deduce column types
  .option("sep", ",")               // field delimiter
  .load("/path/to/data.csv")        // path to the data

// Equivalent convenience method
val dfAlt = spark.read
  .option("header", "true")
  .option("inferSchema", "true")
  .csv("/path/to/data.csv")

Schema inference is convenient for exploration but has costs: Spark must scan part of the data to determine types, which adds latency. For production pipelines, explicitly specifying a StructType schema is recommended:

import org.apache.spark.sql.types._

val schema = StructType(Seq(
  StructField("id", IntegerType, nullable = false),
  StructField("name", StringType, nullable = true),
  StructField("salary", DoubleType, nullable = true)
))

val df = spark.read
  .schema(schema)
  .csv("/path/to/employees.csv")

Predicate pushdown and column pruning are optimizations where Spark pushes filter conditions and column selections down to the data source layer. For columnar formats like Parquet, this means only the required columns and matching row groups are read from disk, significantly reducing I/O.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment