Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Heibaiying BigData Notes Spark TempView and SQL

From Leeroopedia


Knowledge Sources
Domains Data_Analysis, Big_Data
Last Updated 2026-02-10 10:00 GMT

Overview

Spark SQL allows DataFrames to be registered as named temporary views in the catalog, enabling declarative SQL queries to be mixed freely with the programmatic DataFrame API.

Description

One of Spark SQL's most powerful features is the ability to seamlessly blend two paradigms for data manipulation:

  • Programmatic API -- using method calls like select(), filter(), groupBy() on DataFrame objects
  • Declarative SQL -- writing standard SQL query strings and executing them through spark.sql()

The bridge between these two worlds is the temporary view. By calling createOrReplaceTempView() on a DataFrame, the DataFrame is registered under a name in Spark's in-memory catalog. Once registered, that name can be referenced in SQL statements just like a table name in a relational database.

There are two scopes for temporary views:

View Type Scope Method
Local temporary view Visible only within the SparkSession that created it createOrReplaceTempView(name)
Global temporary view Visible across all SparkSessions within the same Spark application createOrReplaceGlobalTempView(name)

Global temporary views are accessed through the global_temp database namespace (e.g., SELECT * FROM global_temp.myView).

Temporary views are not persisted -- they exist only for the lifetime of the SparkSession (or application, for global views). They do not create physical tables or write data to any metastore.

Usage

Registering temporary views and executing SQL queries is useful when:

  • SQL is more natural for the operation at hand, particularly for complex joins, subqueries, window functions, or CTEs
  • Analysts familiar with SQL need to work alongside programmatic Spark code
  • Mixing paradigms -- using the DataFrame API for some transformations and SQL for others in the same pipeline
  • Prototyping -- quickly exploring data with ad-hoc SQL queries in a notebook

Theoretical Basis

Under the hood, spark.sql() parses the SQL string into an Abstract Syntax Tree (AST), resolves table and column references against the catalog, and produces a logical plan that is identical in nature to what the DataFrame API produces. The Catalyst optimizer then applies the same optimization rules regardless of whether the plan originated from SQL text or programmatic API calls.

This means there is no performance difference between equivalent operations expressed via SQL versus the DataFrame API. Both paths converge to the same optimized physical plan.

// Pseudocode demonstrating the SQL and DataFrame API equivalence

// Step 1: Create a DataFrame
val empDF = spark.read.json("/data/employees.json")

// Step 2: Register it as a temporary view
empDF.createOrReplaceTempView("employees")

// Step 3a: Query using SQL
val sqlResult = spark.sql("""
  SELECT department, AVG(salary) AS avg_salary
  FROM employees
  WHERE salary > 50000
  GROUP BY department
  ORDER BY avg_salary DESC
""")

// Step 3b: Equivalent DataFrame API
val apiResult = empDF
  .filter($"salary" > 50000)
  .groupBy("department")
  .agg(avg("salary").as("avg_salary"))
  .orderBy($"avg_salary".desc)

// Both sqlResult and apiResult produce the same logical plan

Key considerations:

  • createOrReplaceTempView will overwrite any existing view with the same name, which is convenient for iterative development in notebooks
  • createTempView (without "OrReplace") will throw an exception if the view already exists, which is safer for production code
  • SQL queries return a DataFrame, so results can be further processed using the programmatic API or registered as another view
  • Views can reference other views, enabling layered SQL logic

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment