Workflow:Heibaiying BigData Notes Spark SQL Data Analysis
| Knowledge Sources | |
|---|---|
| Domains | Big_Data, Data_Analysis, Spark |
| Last Updated | 2026-02-10 10:00 GMT |
Overview
End-to-end process for performing structured data analysis using Spark SQL, from loading external data sources into DataFrames to executing SQL queries, aggregations, and join operations.
Description
This workflow covers the complete data analysis pipeline using Apache Spark's Structured API. It begins with creating a SparkSession as the unified entry point, loading data from various external sources (JSON, CSV, Parquet, ORC, JDBC), performing column operations and DataFrame transformations, registering temporary views for SQL querying, executing aggregation functions and join operations, and writing results to output formats. The workflow leverages Spark's Catalyst optimizer for automatic query optimization.
Usage
Execute this workflow when you need to analyze structured or semi-structured data at scale using SQL-like operations. This is appropriate for batch analytics, data exploration, ETL pipelines, and reporting tasks where data resides in files (JSON, CSV, Parquet) or relational databases accessible via JDBC.
Execution Steps
Step 1: Create SparkSession
Initialize a SparkSession using the builder pattern. Configure the application name, master URL (for local development), and optional settings such as Hive support, warehouse directory, or custom Spark properties.
Key considerations:
- SparkSession is the single entry point replacing older SQLContext and HiveContext
- Use master("local[*]") for local development, omit for cluster submission
- Enable Hive support with enableHiveSupport() to access Hive metastore
- Import spark.implicits._ for implicit conversions in Scala
Step 2: Load Data from External Sources
Read data into DataFrames using the SparkSession read API. Specify the data format (json, csv, parquet, orc, jdbc) and provide format-specific options such as schema inference, header handling, delimiters, or JDBC connection parameters.
What happens:
- spark.read.format("json").load(path) loads JSON files with automatic schema inference
- CSV reading supports header, delimiter, and schema options
- Parquet and ORC are columnar formats with embedded schema
- JDBC reading connects to relational databases with pushdown predicates
- Result is a DataFrame (Dataset[Row]) with inferred or explicit schema
Step 3: Perform DataFrame Transformations
Apply column operations and transformations to the DataFrame using the Structured API. Operations include selecting columns, filtering rows, adding computed columns, renaming columns, sorting, deduplication, and grouping.
Key considerations:
- Use select() to choose or compute columns
- Use filter() or where() for row-level predicates
- Use withColumn() to add or replace columns
- Use groupBy() followed by aggregation functions for grouped analysis
- Use orderBy() or sort() for result ordering
- All operations return new DataFrames (immutable transformation chain)
Step 4: Register Views and Execute SQL Queries
Register DataFrames as temporary views or global temporary views to enable SQL query execution. Write and execute SQL statements against the registered views using the spark.sql() method.
What happens:
- createOrReplaceTempView() registers a session-scoped view
- createGlobalTempView() registers a view accessible across sessions via global_temp database
- spark.sql("SELECT ...") executes SQL and returns a DataFrame
- Full HiveQL syntax is supported including subqueries and window functions
Step 5: Apply Aggregations and Join Operations
Perform advanced analytical operations including aggregation functions (count, sum, avg, min, max, countDistinct), grouped aggregations with multiple aggregate expressions, and join operations (inner, outer, left, right, cross, semi, anti) between DataFrames.
Key considerations:
- Use agg() with multiple aggregation functions for complex grouped analysis
- Join types control how unmatched rows are handled
- Join conditions can use column name equality or complex expressions
- Broadcast joins are automatically applied for small tables
- Window functions enable running aggregations and ranking
Step 6: Write Results to Output
Save the analysis results to external storage using the DataFrame write API. Choose the output format (parquet, json, csv, orc, jdbc), save mode (overwrite, append, ignore, error), and optional partitioning for organized output.
What happens:
- df.write.format("parquet").save(path) writes to Parquet files
- Save modes control behavior when output path exists
- partitionBy() creates directory-based partitioning for efficient reads
- JDBC write persists results to relational databases
- Compression options reduce storage footprint