Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Heibaiying BigData Notes Spark SQL Data Analysis

From Leeroopedia


Knowledge Sources
Domains Big_Data, Data_Analysis, Spark
Last Updated 2026-02-10 10:00 GMT

Overview

End-to-end process for performing structured data analysis using Spark SQL, from loading external data sources into DataFrames to executing SQL queries, aggregations, and join operations.

Description

This workflow covers the complete data analysis pipeline using Apache Spark's Structured API. It begins with creating a SparkSession as the unified entry point, loading data from various external sources (JSON, CSV, Parquet, ORC, JDBC), performing column operations and DataFrame transformations, registering temporary views for SQL querying, executing aggregation functions and join operations, and writing results to output formats. The workflow leverages Spark's Catalyst optimizer for automatic query optimization.

Usage

Execute this workflow when you need to analyze structured or semi-structured data at scale using SQL-like operations. This is appropriate for batch analytics, data exploration, ETL pipelines, and reporting tasks where data resides in files (JSON, CSV, Parquet) or relational databases accessible via JDBC.

Execution Steps

Step 1: Create SparkSession

Initialize a SparkSession using the builder pattern. Configure the application name, master URL (for local development), and optional settings such as Hive support, warehouse directory, or custom Spark properties.

Key considerations:

  • SparkSession is the single entry point replacing older SQLContext and HiveContext
  • Use master("local[*]") for local development, omit for cluster submission
  • Enable Hive support with enableHiveSupport() to access Hive metastore
  • Import spark.implicits._ for implicit conversions in Scala

Step 2: Load Data from External Sources

Read data into DataFrames using the SparkSession read API. Specify the data format (json, csv, parquet, orc, jdbc) and provide format-specific options such as schema inference, header handling, delimiters, or JDBC connection parameters.

What happens:

  • spark.read.format("json").load(path) loads JSON files with automatic schema inference
  • CSV reading supports header, delimiter, and schema options
  • Parquet and ORC are columnar formats with embedded schema
  • JDBC reading connects to relational databases with pushdown predicates
  • Result is a DataFrame (Dataset[Row]) with inferred or explicit schema

Step 3: Perform DataFrame Transformations

Apply column operations and transformations to the DataFrame using the Structured API. Operations include selecting columns, filtering rows, adding computed columns, renaming columns, sorting, deduplication, and grouping.

Key considerations:

  • Use select() to choose or compute columns
  • Use filter() or where() for row-level predicates
  • Use withColumn() to add or replace columns
  • Use groupBy() followed by aggregation functions for grouped analysis
  • Use orderBy() or sort() for result ordering
  • All operations return new DataFrames (immutable transformation chain)

Step 4: Register Views and Execute SQL Queries

Register DataFrames as temporary views or global temporary views to enable SQL query execution. Write and execute SQL statements against the registered views using the spark.sql() method.

What happens:

  • createOrReplaceTempView() registers a session-scoped view
  • createGlobalTempView() registers a view accessible across sessions via global_temp database
  • spark.sql("SELECT ...") executes SQL and returns a DataFrame
  • Full HiveQL syntax is supported including subqueries and window functions

Step 5: Apply Aggregations and Join Operations

Perform advanced analytical operations including aggregation functions (count, sum, avg, min, max, countDistinct), grouped aggregations with multiple aggregate expressions, and join operations (inner, outer, left, right, cross, semi, anti) between DataFrames.

Key considerations:

  • Use agg() with multiple aggregation functions for complex grouped analysis
  • Join types control how unmatched rows are handled
  • Join conditions can use column name equality or complex expressions
  • Broadcast joins are automatically applied for small tables
  • Window functions enable running aggregations and ranking

Step 6: Write Results to Output

Save the analysis results to external storage using the DataFrame write API. Choose the output format (parquet, json, csv, orc, jdbc), save mode (overwrite, append, ignore, error), and optional partitioning for organized output.

What happens:

  • df.write.format("parquet").save(path) writes to Parquet files
  • Save modes control behavior when output path exists
  • partitionBy() creates directory-based partitioning for efficient reads
  • JDBC write persists results to relational databases
  • Compression options reduce storage footprint

Execution Diagram

GitHub URL

Workflow Repository