Implementation:Heibaiying BigData Notes Spark SQL View Registration
| Knowledge Sources | |
|---|---|
| Domains | Data_Analysis, Big_Data |
| Last Updated | 2026-02-10 10:00 GMT |
Overview
Concrete tool for registering DataFrames as temporary views and executing SQL queries provided by Apache Spark.
Description
The createOrReplaceTempView() method registers a DataFrame as a named view in Spark's in-memory catalog, and spark.sql() executes a SQL string against that catalog, returning the result as a DataFrame. Together, these two methods enable the full power of declarative SQL within a Spark application.
The BigData-Notes repository demonstrates this pattern in the Structured API guide (registering views from programmatically created DataFrames) and in the aggregation functions guide (using SQL for aggregate computations on registered views).
Usage
Use createOrReplaceTempView() to expose a DataFrame as a SQL-queryable table. Use spark.sql() to run SQL statements against registered views. This is particularly valuable for complex analytical queries where SQL syntax (especially window functions, CTEs, and subqueries) is more expressive than the programmatic API.
Code Reference
Source Location
- Repository files:
notes/Spark_Structured_API的基本使用.md(approximately line 35 for view registration)notes/SparkSQL常用聚合函数.md(approximately line 35 for SQL-based aggregations)
- External classes:
org.apache.spark.sql.SparkSession,org.apache.spark.sql.DataFrame - External documentation: createOrReplaceTempView Scaladoc
Signature
// Register a DataFrame as a temporary view
df.createOrReplaceTempView(viewName: String): Unit
df.createTempView(viewName: String): Unit
df.createOrReplaceGlobalTempView(viewName: String): Unit
df.createGlobalTempView(viewName: String): Unit
// Execute a SQL query against registered views
spark.sql(sqlText: String): DataFrame
Import
import org.apache.spark.sql.SparkSession
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| viewName | String | Yes | The name under which the DataFrame will be registered in the catalog |
| sqlText | String | Yes (for spark.sql) | A valid Spark SQL query string referencing registered view names |
Outputs
| Name | Type | Description |
|---|---|---|
| Unit | Unit | createOrReplaceTempView returns nothing; it registers the view as a side effect |
| DataFrame | org.apache.spark.sql.DataFrame | spark.sql() returns a DataFrame containing the query results |
Usage Examples
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder()
.appName("SQL-View-Examples")
.master("local[*]")
.getOrCreate()
import spark.implicits._
// Create sample DataFrames
val empDF = Seq(
(1, "Alice", 10, 90000),
(2, "Bob", 20, 75000),
(3, "Carol", 10, 95000),
(4, "Dave", 20, 70000),
(5, "Eve", 10, 88000)
).toDF("empId", "name", "deptId", "salary")
val deptDF = Seq(
(10, "Engineering"),
(20, "Marketing"),
(30, "Sales")
).toDF("deptId", "deptName")
// Register as temporary views
empDF.createOrReplaceTempView("employees")
deptDF.createOrReplaceTempView("departments")
// --- Simple SQL query ---
spark.sql("SELECT * FROM employees WHERE salary > 80000").show()
// --- Join using SQL ---
val joinResult = spark.sql("""
SELECT e.name, d.deptName, e.salary
FROM employees e
JOIN departments d ON e.deptId = d.deptId
ORDER BY e.salary DESC
""")
joinResult.show()
// --- Aggregation using SQL ---
val aggResult = spark.sql("""
SELECT d.deptName,
COUNT(*) AS emp_count,
AVG(e.salary) AS avg_salary,
MAX(e.salary) AS max_salary
FROM employees e
JOIN departments d ON e.deptId = d.deptId
GROUP BY d.deptName
""")
aggResult.show()
// --- Window function using SQL ---
val windowResult = spark.sql("""
SELECT name, salary, deptId,
RANK() OVER (PARTITION BY deptId ORDER BY salary DESC) AS rank
FROM employees
""")
windowResult.show()
// --- The result of spark.sql() is a DataFrame, so further API calls work ---
joinResult
.filter($"salary" > 80000)
.select("name", "deptName")
.show()