Principle:Haifengl Smile SQL Query Execution
Overview
SQL Query Execution is the principle of running SQL statements against loaded data to perform aggregations, joins, filtering, window functions, and other analytical computations. SQL provides a declarative interface for complex data transformations: the user specifies what result is desired, and the query engine determines how to compute it efficiently.
In the Smile library, query execution is the core analytical operation that transforms raw tabular data into meaningful insights, summary statistics, and feature sets suitable for machine learning.
Theoretical Basis
Relational Algebra Foundations
SQL is a declarative language built on relational algebra, a formal system for manipulating sets of tuples (rows). The fundamental operations are:
| Operation | SQL Equivalent | Symbol | Description |
|---|---|---|---|
| Selection | WHERE |
sigma | Filter rows based on a predicate. |
| Projection | SELECT col1, col2 |
pi | Select specific columns. |
| Cartesian Product | CROSS JOIN |
x | Combine every row of one table with every row of another. |
| Union | UNION |
U | Combine rows from two tables with the same schema. |
| Difference | EXCEPT |
- | Rows in one table but not in another. |
| Rename | AS |
rho | Rename columns or tables. |
Joins are derived from selection applied to Cartesian products. An equi-join (A JOIN B ON A.id = B.id) is equivalent to sigma(A x B) where the predicate tests equality of the join keys.
Query Optimization
The query optimizer transforms a declarative SQL statement into an efficient execution plan. Key optimization strategies include:
- Predicate pushdown: Move
WHEREfilters as close to the data source as possible, reducing the volume of data processed. - Projection pushdown: Only read columns that are referenced in the query, avoiding unnecessary I/O.
- Join reordering: Determine the optimal order to join multiple tables, minimizing intermediate result sizes.
- Hash aggregation: Use hash tables for
GROUP BYoperations instead of sorting, which is O(n) vs O(n log n). - Vectorized execution: Process data in batches (vectors) rather than row-by-row, exploiting CPU cache locality and SIMD instructions.
DuckDB's optimizer is particularly effective for OLAP queries because of its columnar vectorized execution engine, which processes data in batches of 2048 values at a time, achieving near-native performance for analytical workloads.
Three Categories of SQL Statements
SQL statements are categorized by their purpose:
| Category | Statements | Returns | Use Case |
|---|---|---|---|
| DQL (Data Query Language) | SELECT |
Result set (rows and columns) | Reading and analyzing data. |
| DML (Data Manipulation Language) | INSERT, UPDATE, DELETE |
Row count (number affected) | Modifying data in tables. |
| DDL (Data Definition Language) | CREATE, DROP, ALTER |
Boolean success indicator | Defining or modifying table structure. |
Each category requires a different execution method. Smile's SQL class provides three corresponding methods: query() for DQL, update() for DML, and execute() for general statements.
Analytical SQL Patterns
Modern analytical databases support advanced SQL features beyond basic CRUD:
Window Functions enable calculations across related rows without collapsing them into groups:
SELECT name, department, salary,
RANK() OVER (PARTITION BY department ORDER BY salary DESC) as dept_rank
FROM employees
Common Table Expressions (CTEs) enable modular query composition:
WITH monthly_sales AS (
SELECT date_trunc('month', sale_date) AS month, SUM(amount) AS total
FROM sales GROUP BY month
)
SELECT month, total, LAG(total) OVER (ORDER BY month) AS prev_month
FROM monthly_sales
GROUPING SETS, ROLLUP, CUBE enable multi-level aggregation in a single pass:
SELECT region, product, SUM(revenue) FROM sales GROUP BY GROUPING SETS ((region), (product), (region, product), ())
Relevance to Machine Learning
SQL query execution serves as the feature engineering layer in ML pipelines:
- Aggregation queries compute summary statistics (mean, variance, count) that become features for models.
- Join queries combine data from multiple sources to create enriched feature vectors.
- Window functions compute lag features, rolling averages, and rankings that capture temporal or ordinal patterns.
- Filtering queries select training subsets, remove outliers, or split data into train/test sets.
The result of a query is a DataFrame, which serves as the bridge between SQL-based data preparation and numerical ML algorithms.
Knowledge Sources
Domains
Data_Engineering, SQL, Analytics, Feature_Engineering
Related
- Implementation:Haifengl_Smile_SQL_Query_API - The concrete API methods (query, execute, update) that execute SQL statements.
- Principle:Haifengl_Smile_SQL_Database_Connection - A connection must exist before queries can be executed.
- Principle:Haifengl_Smile_SQL_Data_Ingestion - Tables must be loaded before they can be queried.
- Principle:Haifengl_Smile_SQL_to_ML_Bridge - Query results are converted to numerical arrays for ML.