Implementation:Haifengl Smile SQL Query API
Overview
The SQL Query API provides three execution methods on smile.data.SQL for running SQL statements against the in-process DuckDB database: query() for SELECT statements that return DataFrames, update() for DML statements that return affected row counts, and execute() for general SQL statements. These methods form the core analytical interface of Smile's SQL pipeline.
API Signature
public class SQL implements AutoCloseable {
/**
* Executes a SELECT statement.
* @param sql a SELECT statement.
* @return the query result as a DataFrame.
* @throws SQLException if fail to execute the SQL query.
*/
public DataFrame query(String sql) throws SQLException
/**
* Executes an INSERT, UPDATE, or DELETE statement.
* @param sql an INSERT, UPDATE, or DELETE statement.
* @return the number of rows affected by the SQL statement.
* @throws SQLException if fail to execute the SQL update.
*/
public int update(String sql) throws SQLException
/**
* Executes an SQL statement, which may return multiple results.
* @param sql an SQL statement.
* @return true if the first result is a ResultSet object;
* false if it is an update count or there are no results.
* @throws SQLException if fail to execute the SQL query.
*/
public boolean execute(String sql) throws SQLException
}
Source
base/src/main/java/smile/data/SQL.java:L385-417
Import
import smile.data.SQL;
import smile.data.DataFrame;
Type
API Doc
Inputs
| Method | Parameter | Type | Description |
|---|---|---|---|
query() |
sql |
String |
A SQL SELECT statement that returns a result set. |
update() |
sql |
String |
A SQL INSERT, UPDATE, or DELETE statement. |
execute() |
sql |
String |
Any SQL statement (DDL, DML, or DQL). |
Outputs
| Method | Return Type | Description |
|---|---|---|
query() |
DataFrame |
A Smile DataFrame containing all rows and columns from the query result. Created via DataFrame.of(ResultSet).
|
update() |
int |
The number of rows affected by the DML statement. |
execute() |
boolean |
true if the first result is a ResultSet; false if it is an update count or there are no results.
|
Internal Mechanism
All three methods follow the same pattern:
- Log the SQL statement at INFO level via SLF4J.
- Create a JDBC
Statementusing the database connection. - Execute the SQL using the appropriate JDBC method.
- Return the result (converting
ResultSettoDataFrameforquery()). - Automatically close the
Statementvia try-with-resources.
// query() implementation
public DataFrame query(String sql) throws SQLException {
logger.info(sql);
try (var stmt = db.createStatement()) {
return DataFrame.of(stmt.executeQuery(sql));
}
}
// update() implementation
public int update(String sql) throws SQLException {
logger.info(sql);
try (var stmt = db.createStatement()) {
return stmt.executeUpdate(sql);
}
}
// execute() implementation
public boolean execute(String sql) throws SQLException {
logger.info(sql);
try (var stmt = db.createStatement()) {
return stmt.execute(sql);
}
}
The DataFrame.of(ResultSet) factory method reads all rows from the JDBC ResultSet into a Smile DataFrame, mapping JDBC types to appropriate Smile column vectors (IntVector, DoubleVector, StringVector, etc.).
Usage Examples
Basic Aggregation Query
import smile.data.SQL;
import smile.data.DataFrame;
try (var db = new SQL()) {
db.csv("sales", "data/sales.csv");
// Aggregate sales by region
DataFrame result = db.query("""
SELECT region,
COUNT(*) AS num_transactions,
SUM(amount) AS total_revenue,
AVG(amount) AS avg_transaction
FROM sales
GROUP BY region
ORDER BY total_revenue DESC
""");
System.out.println(result);
}
Join Query Across Multiple Tables
import smile.data.SQL;
import smile.data.DataFrame;
try (var db = new SQL()) {
db.csv("customers", "data/customers.csv")
.csv("orders", "data/orders.csv")
.csv("products", "data/products.csv");
// Join three tables to compute customer lifetime value
DataFrame clv = db.query("""
SELECT c.customer_id,
c.name,
COUNT(DISTINCT o.order_id) AS num_orders,
SUM(o.quantity * p.price) AS lifetime_value
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
JOIN products p ON o.product_id = p.product_id
GROUP BY c.customer_id, c.name
ORDER BY lifetime_value DESC
LIMIT 100
""");
System.out.println(clv);
}
Window Functions for Time-Series Features
import smile.data.SQL;
import smile.data.DataFrame;
try (var db = new SQL()) {
db.parquet("stock_prices", "data/prices.parquet");
// Compute rolling averages and lag features
DataFrame features = db.query("""
SELECT ticker, trade_date, close_price,
LAG(close_price, 1) OVER w AS prev_close,
AVG(close_price) OVER (
PARTITION BY ticker ORDER BY trade_date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) AS moving_avg_7d,
close_price - LAG(close_price, 1) OVER w AS daily_change
FROM stock_prices
WINDOW w AS (PARTITION BY ticker ORDER BY trade_date)
ORDER BY ticker, trade_date
""");
System.out.println(features);
}
Using update() to Modify Data
import smile.data.SQL;
try (var db = new SQL()) {
db.csv("data", "raw_data.csv");
// Remove outliers
int removed = db.update("DELETE FROM data WHERE value > 1000 OR value < -1000");
System.out.println("Removed " + removed + " outlier rows");
// Add a computed column via a new table
db.execute("""
CREATE TABLE normalized AS
SELECT *,
(value - (SELECT AVG(value) FROM data)) /
(SELECT STDDEV(value) FROM data) AS z_score
FROM data
""");
var result = db.query("SELECT * FROM normalized LIMIT 10");
System.out.println(result);
}
Feature Engineering for ML
import smile.data.SQL;
import smile.data.DataFrame;
try (var db = new SQL()) {
db.csv("transactions", "data/transactions.csv")
.csv("users", "data/users.csv");
// Engineer features for a fraud detection model
DataFrame features = db.query("""
WITH user_stats AS (
SELECT user_id,
COUNT(*) AS total_tx,
AVG(amount) AS avg_amount,
STDDEV(amount) AS std_amount,
MAX(amount) AS max_amount
FROM transactions
GROUP BY user_id
)
SELECT t.transaction_id,
t.amount,
t.amount / us.avg_amount AS amount_ratio,
(t.amount - us.avg_amount) / NULLIF(us.std_amount, 0) AS z_score,
us.total_tx AS user_total_tx,
u.account_age_days,
t.is_fraud AS label
FROM transactions t
JOIN user_stats us ON t.user_id = us.user_id
JOIN users u ON t.user_id = u.user_id
""");
// Convert to ML-ready arrays
double[][] X = features.select("amount", "amount_ratio", "z_score",
"user_total_tx", "account_age_days").toArray();
System.out.println("Feature matrix: " + X.length + " x " + X[0].length);
}
Using execute() for DDL Statements
import smile.data.SQL;
try (var db = new SQL()) {
// Create a table with explicit schema
db.execute("""
CREATE TABLE experiments (
experiment_id INTEGER PRIMARY KEY,
model_name VARCHAR,
accuracy DOUBLE,
f1_score DOUBLE,
training_time_ms BIGINT,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)
""");
// Insert experiment results
int rows = db.update("""
INSERT INTO experiments (experiment_id, model_name, accuracy, f1_score, training_time_ms)
VALUES (1, 'RandomForest', 0.945, 0.932, 1200),
(2, 'GradientBoosting', 0.961, 0.955, 3400),
(3, 'SVM', 0.938, 0.921, 8500)
""");
System.out.println("Inserted " + rows + " experiment records");
var leaderboard = db.query("SELECT * FROM experiments ORDER BY accuracy DESC");
System.out.println(leaderboard);
}
Method Selection Guide
| Use Case | Method | Example |
|---|---|---|
| Read data, compute aggregations | query() |
SELECT ... FROM ... GROUP BY ...
|
| Insert new rows | update() |
INSERT INTO ... VALUES ...
|
| Delete rows (e.g., outlier removal) | update() |
DELETE FROM ... WHERE ...
|
| Create/drop tables | execute() |
CREATE TABLE ..., DROP TABLE ...
|
| Install/load extensions | execute() |
INSTALL httpfs;
|
| Any statement where you are unsure | execute() |
Works for any SQL statement type. |
Related
- Principle:Haifengl_Smile_SQL_Query_Execution - The theoretical basis for SQL query execution and relational algebra.
- Implementation:Haifengl_Smile_SQL_Constructor - Queries require an active SQL connection.
- Implementation:Haifengl_Smile_SQL_Data_Loaders - Tables must be loaded before they can be queried.
- Implementation:Haifengl_Smile_DataFrame_To_Numerical_For_SQL - Query results (DataFrames) are converted to numerical arrays for ML.