Principle:Heibaiying BigData Notes Spark TempView and SQL

Knowledge Sources	Apache Spark SQL Guide
Domains	Data_Analysis, Big_Data
Last Updated	2026-02-10 10:00 GMT

Overview

Spark SQL allows DataFrames to be registered as named temporary views in the catalog, enabling declarative SQL queries to be mixed freely with the programmatic DataFrame API.

Description

One of Spark SQL's most powerful features is the ability to seamlessly blend two paradigms for data manipulation:

Programmatic API -- using method calls like select(), filter(), groupBy() on DataFrame objects
Declarative SQL -- writing standard SQL query strings and executing them through spark.sql()

The bridge between these two worlds is the temporary view. By calling createOrReplaceTempView() on a DataFrame, the DataFrame is registered under a name in Spark's in-memory catalog. Once registered, that name can be referenced in SQL statements just like a table name in a relational database.

There are two scopes for temporary views:

View Type	Scope	Method
Local temporary view	Visible only within the SparkSession that created it	createOrReplaceTempView(name)
Global temporary view	Visible across all SparkSessions within the same Spark application	createOrReplaceGlobalTempView(name)

Global temporary views are accessed through the global_temp database namespace (e.g., SELECT * FROM global_temp.myView).

Temporary views are not persisted -- they exist only for the lifetime of the SparkSession (or application, for global views). They do not create physical tables or write data to any metastore.

Usage

Registering temporary views and executing SQL queries is useful when:

SQL is more natural for the operation at hand, particularly for complex joins, subqueries, window functions, or CTEs
Analysts familiar with SQL need to work alongside programmatic Spark code
Mixing paradigms -- using the DataFrame API for some transformations and SQL for others in the same pipeline
Prototyping -- quickly exploring data with ad-hoc SQL queries in a notebook

Theoretical Basis

Under the hood, spark.sql() parses the SQL string into an Abstract Syntax Tree (AST), resolves table and column references against the catalog, and produces a logical plan that is identical in nature to what the DataFrame API produces. The Catalyst optimizer then applies the same optimization rules regardless of whether the plan originated from SQL text or programmatic API calls.

This means there is no performance difference between equivalent operations expressed via SQL versus the DataFrame API. Both paths converge to the same optimized physical plan.

// Pseudocode demonstrating the SQL and DataFrame API equivalence

// Step 1: Create a DataFrame
val empDF = spark.read.json("/data/employees.json")

// Step 2: Register it as a temporary view
empDF.createOrReplaceTempView("employees")

// Step 3a: Query using SQL
val sqlResult = spark.sql("""
  SELECT department, AVG(salary) AS avg_salary
  FROM employees
  WHERE salary > 50000
  GROUP BY department
  ORDER BY avg_salary DESC
""")

// Step 3b: Equivalent DataFrame API
val apiResult = empDF
  .filter($"salary" > 50000)
  .groupBy("department")
  .agg(avg("salary").as("avg_salary"))
  .orderBy($"avg_salary".desc)

// Both sqlResult and apiResult produce the same logical plan

Key considerations:

createOrReplaceTempView will overwrite any existing view with the same name, which is convenient for iterative development in notebooks
createTempView (without "OrReplace") will throw an exception if the view already exists, which is safer for production code
SQL queries return a DataFrame, so results can be further processed using the programmatic API or registered as another view
Views can reference other views, enabling layered SQL logic

Related Pages

Implemented By

Implementation:Heibaiying_BigData_Notes_Spark_SQL_View_Registration

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment