Principle:Heibaiying BigData Notes Spark TempView and SQL
| Knowledge Sources | |
|---|---|
| Domains | Data_Analysis, Big_Data |
| Last Updated | 2026-02-10 10:00 GMT |
Overview
Spark SQL allows DataFrames to be registered as named temporary views in the catalog, enabling declarative SQL queries to be mixed freely with the programmatic DataFrame API.
Description
One of Spark SQL's most powerful features is the ability to seamlessly blend two paradigms for data manipulation:
- Programmatic API -- using method calls like select(), filter(), groupBy() on DataFrame objects
- Declarative SQL -- writing standard SQL query strings and executing them through spark.sql()
The bridge between these two worlds is the temporary view. By calling createOrReplaceTempView() on a DataFrame, the DataFrame is registered under a name in Spark's in-memory catalog. Once registered, that name can be referenced in SQL statements just like a table name in a relational database.
There are two scopes for temporary views:
| View Type | Scope | Method |
|---|---|---|
| Local temporary view | Visible only within the SparkSession that created it | createOrReplaceTempView(name) |
| Global temporary view | Visible across all SparkSessions within the same Spark application | createOrReplaceGlobalTempView(name) |
Global temporary views are accessed through the global_temp database namespace (e.g., SELECT * FROM global_temp.myView).
Temporary views are not persisted -- they exist only for the lifetime of the SparkSession (or application, for global views). They do not create physical tables or write data to any metastore.
Usage
Registering temporary views and executing SQL queries is useful when:
- SQL is more natural for the operation at hand, particularly for complex joins, subqueries, window functions, or CTEs
- Analysts familiar with SQL need to work alongside programmatic Spark code
- Mixing paradigms -- using the DataFrame API for some transformations and SQL for others in the same pipeline
- Prototyping -- quickly exploring data with ad-hoc SQL queries in a notebook
Theoretical Basis
Under the hood, spark.sql() parses the SQL string into an Abstract Syntax Tree (AST), resolves table and column references against the catalog, and produces a logical plan that is identical in nature to what the DataFrame API produces. The Catalyst optimizer then applies the same optimization rules regardless of whether the plan originated from SQL text or programmatic API calls.
This means there is no performance difference between equivalent operations expressed via SQL versus the DataFrame API. Both paths converge to the same optimized physical plan.
// Pseudocode demonstrating the SQL and DataFrame API equivalence
// Step 1: Create a DataFrame
val empDF = spark.read.json("/data/employees.json")
// Step 2: Register it as a temporary view
empDF.createOrReplaceTempView("employees")
// Step 3a: Query using SQL
val sqlResult = spark.sql("""
SELECT department, AVG(salary) AS avg_salary
FROM employees
WHERE salary > 50000
GROUP BY department
ORDER BY avg_salary DESC
""")
// Step 3b: Equivalent DataFrame API
val apiResult = empDF
.filter($"salary" > 50000)
.groupBy("department")
.agg(avg("salary").as("avg_salary"))
.orderBy($"avg_salary".desc)
// Both sqlResult and apiResult produce the same logical plan
Key considerations:
- createOrReplaceTempView will overwrite any existing view with the same name, which is convenient for iterative development in notebooks
- createTempView (without "OrReplace") will throw an exception if the view already exists, which is safer for production code
- SQL queries return a DataFrame, so results can be further processed using the programmatic API or registered as another view
- Views can reference other views, enabling layered SQL logic