Principle:Haifengl Smile SQL Query Execution

Overview

SQL Query Execution is the principle of running SQL statements against loaded data to perform aggregations, joins, filtering, window functions, and other analytical computations. SQL provides a declarative interface for complex data transformations: the user specifies what result is desired, and the query engine determines how to compute it efficiently.

In the Smile library, query execution is the core analytical operation that transforms raw tabular data into meaningful insights, summary statistics, and feature sets suitable for machine learning.

Theoretical Basis

Relational Algebra Foundations

SQL is a declarative language built on relational algebra, a formal system for manipulating sets of tuples (rows). The fundamental operations are:

Operation	SQL Equivalent	Symbol	Description
Selection	`WHERE`	sigma	Filter rows based on a predicate.
Projection	`SELECT col1, col2`	pi	Select specific columns.
Cartesian Product	`CROSS JOIN`	x	Combine every row of one table with every row of another.
Union	`UNION`	U	Combine rows from two tables with the same schema.
Difference	`EXCEPT`	-	Rows in one table but not in another.
Rename	`AS`	rho	Rename columns or tables.

Joins are derived from selection applied to Cartesian products. An equi-join (A JOIN B ON A.id = B.id) is equivalent to sigma(A x B) where the predicate tests equality of the join keys.

Query Optimization

The query optimizer transforms a declarative SQL statement into an efficient execution plan. Key optimization strategies include:

Predicate pushdown: Move WHERE filters as close to the data source as possible, reducing the volume of data processed.
Projection pushdown: Only read columns that are referenced in the query, avoiding unnecessary I/O.
Join reordering: Determine the optimal order to join multiple tables, minimizing intermediate result sizes.
Hash aggregation: Use hash tables for GROUP BY operations instead of sorting, which is O(n) vs O(n log n).
Vectorized execution: Process data in batches (vectors) rather than row-by-row, exploiting CPU cache locality and SIMD instructions.

DuckDB's optimizer is particularly effective for OLAP queries because of its columnar vectorized execution engine, which processes data in batches of 2048 values at a time, achieving near-native performance for analytical workloads.

Three Categories of SQL Statements

SQL statements are categorized by their purpose:

Category	Statements	Returns	Use Case
DQL (Data Query Language)	`SELECT`	Result set (rows and columns)	Reading and analyzing data.
DML (Data Manipulation Language)	`INSERT`, `UPDATE`, `DELETE`	Row count (number affected)	Modifying data in tables.
DDL (Data Definition Language)	`CREATE`, `DROP`, `ALTER`	Boolean success indicator	Defining or modifying table structure.

Each category requires a different execution method. Smile's SQL class provides three corresponding methods: query() for DQL, update() for DML, and execute() for general statements.

Analytical SQL Patterns

Modern analytical databases support advanced SQL features beyond basic CRUD:

Window Functions enable calculations across related rows without collapsing them into groups:

SELECT name, department, salary,
       RANK() OVER (PARTITION BY department ORDER BY salary DESC) as dept_rank
FROM employees

Common Table Expressions (CTEs) enable modular query composition:

WITH monthly_sales AS (
    SELECT date_trunc('month', sale_date) AS month, SUM(amount) AS total
    FROM sales GROUP BY month
)
SELECT month, total, LAG(total) OVER (ORDER BY month) AS prev_month
FROM monthly_sales

GROUPING SETS, ROLLUP, CUBE enable multi-level aggregation in a single pass:

SELECT region, product, SUM(revenue)
FROM sales
GROUP BY GROUPING SETS ((region), (product), (region, product), ())

Relevance to Machine Learning

SQL query execution serves as the feature engineering layer in ML pipelines:

Aggregation queries compute summary statistics (mean, variance, count) that become features for models.
Join queries combine data from multiple sources to create enriched feature vectors.
Window functions compute lag features, rolling averages, and rankings that capture temporal or ordinal patterns.
Filtering queries select training subsets, remove outliers, or split data into train/test sets.

The result of a query is a DataFrame, which serves as the bridge between SQL-based data preparation and numerical ML algorithms.

Knowledge Sources

Domains

Data_Engineering, SQL, Analytics, Feature_Engineering

Implementation:Haifengl_Smile_SQL_Query_API - The concrete API methods (query, execute, update) that execute SQL statements.
Principle:Haifengl_Smile_SQL_Database_Connection - A connection must exist before queries can be executed.
Principle:Haifengl_Smile_SQL_Data_Ingestion - Tables must be loaded before they can be queried.
Principle:Haifengl_Smile_SQL_to_ML_Bridge - Query results are converted to numerical arrays for ML.

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment