Implementation:Haifengl Smile SQL Query API

Overview

The SQL Query API provides three execution methods on smile.data.SQL for running SQL statements against the in-process DuckDB database: query() for SELECT statements that return DataFrames, update() for DML statements that return affected row counts, and execute() for general SQL statements. These methods form the core analytical interface of Smile's SQL pipeline.

API Signature

public class SQL implements AutoCloseable {
    /**
     * Executes a SELECT statement.
     * @param sql a SELECT statement.
     * @return the query result as a DataFrame.
     * @throws SQLException if fail to execute the SQL query.
     */
    public DataFrame query(String sql) throws SQLException

    /**
     * Executes an INSERT, UPDATE, or DELETE statement.
     * @param sql an INSERT, UPDATE, or DELETE statement.
     * @return the number of rows affected by the SQL statement.
     * @throws SQLException if fail to execute the SQL update.
     */
    public int update(String sql) throws SQLException

    /**
     * Executes an SQL statement, which may return multiple results.
     * @param sql an SQL statement.
     * @return true if the first result is a ResultSet object;
     *         false if it is an update count or there are no results.
     * @throws SQLException if fail to execute the SQL query.
     */
    public boolean execute(String sql) throws SQLException
}

Source

base/src/main/java/smile/data/SQL.java:L385-417

Import

import smile.data.SQL;
import smile.data.DataFrame;

Type

API Doc

Inputs

Method	Parameter	Type	Description
`query()`	`sql`	`String`	A SQL SELECT statement that returns a result set.
`update()`	`sql`	`String`	A SQL INSERT, UPDATE, or DELETE statement.
`execute()`	`sql`	`String`	Any SQL statement (DDL, DML, or DQL).

Outputs

Method	Return Type	Description
`query()`	`DataFrame`	A Smile DataFrame containing all rows and columns from the query result. Created via `DataFrame.of(ResultSet)`.
`update()`	`int`	The number of rows affected by the DML statement.
`execute()`	`boolean`	`true` if the first result is a ResultSet; `false` if it is an update count or there are no results.

Internal Mechanism

All three methods follow the same pattern:

Log the SQL statement at INFO level via SLF4J.
Create a JDBC Statement using the database connection.
Execute the SQL using the appropriate JDBC method.
Return the result (converting ResultSet to DataFrame for query()).
Automatically close the Statement via try-with-resources.

// query() implementation
public DataFrame query(String sql) throws SQLException {
    logger.info(sql);
    try (var stmt = db.createStatement()) {
        return DataFrame.of(stmt.executeQuery(sql));
    }
}

// update() implementation
public int update(String sql) throws SQLException {
    logger.info(sql);
    try (var stmt = db.createStatement()) {
        return stmt.executeUpdate(sql);
    }
}

// execute() implementation
public boolean execute(String sql) throws SQLException {
    logger.info(sql);
    try (var stmt = db.createStatement()) {
        return stmt.execute(sql);
    }
}

The DataFrame.of(ResultSet) factory method reads all rows from the JDBC ResultSet into a Smile DataFrame, mapping JDBC types to appropriate Smile column vectors (IntVector, DoubleVector, StringVector, etc.).

Usage Examples

Basic Aggregation Query

import smile.data.SQL;
import smile.data.DataFrame;

try (var db = new SQL()) {
    db.csv("sales", "data/sales.csv");

    // Aggregate sales by region
    DataFrame result = db.query("""
        SELECT region,
               COUNT(*) AS num_transactions,
               SUM(amount) AS total_revenue,
               AVG(amount) AS avg_transaction
        FROM sales
        GROUP BY region
        ORDER BY total_revenue DESC
    """);
    System.out.println(result);
}

Join Query Across Multiple Tables

import smile.data.SQL;
import smile.data.DataFrame;

try (var db = new SQL()) {
    db.csv("customers", "data/customers.csv")
      .csv("orders", "data/orders.csv")
      .csv("products", "data/products.csv");

    // Join three tables to compute customer lifetime value
    DataFrame clv = db.query("""
        SELECT c.customer_id,
               c.name,
               COUNT(DISTINCT o.order_id) AS num_orders,
               SUM(o.quantity * p.price) AS lifetime_value
        FROM customers c
        JOIN orders o ON c.customer_id = o.customer_id
        JOIN products p ON o.product_id = p.product_id
        GROUP BY c.customer_id, c.name
        ORDER BY lifetime_value DESC
        LIMIT 100
    """);
    System.out.println(clv);
}

Window Functions for Time-Series Features

import smile.data.SQL;
import smile.data.DataFrame;

try (var db = new SQL()) {
    db.parquet("stock_prices", "data/prices.parquet");

    // Compute rolling averages and lag features
    DataFrame features = db.query("""
        SELECT ticker, trade_date, close_price,
               LAG(close_price, 1) OVER w AS prev_close,
               AVG(close_price) OVER (
                   PARTITION BY ticker ORDER BY trade_date
                   ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
               ) AS moving_avg_7d,
               close_price - LAG(close_price, 1) OVER w AS daily_change
        FROM stock_prices
        WINDOW w AS (PARTITION BY ticker ORDER BY trade_date)
        ORDER BY ticker, trade_date
    """);
    System.out.println(features);
}

Using update() to Modify Data

import smile.data.SQL;

try (var db = new SQL()) {
    db.csv("data", "raw_data.csv");

    // Remove outliers
    int removed = db.update("DELETE FROM data WHERE value > 1000 OR value < -1000");
    System.out.println("Removed " + removed + " outlier rows");

    // Add a computed column via a new table
    db.execute("""
        CREATE TABLE normalized AS
        SELECT *,
               (value - (SELECT AVG(value) FROM data)) /
               (SELECT STDDEV(value) FROM data) AS z_score
        FROM data
    """);

    var result = db.query("SELECT * FROM normalized LIMIT 10");
    System.out.println(result);
}

Feature Engineering for ML

import smile.data.SQL;
import smile.data.DataFrame;

try (var db = new SQL()) {
    db.csv("transactions", "data/transactions.csv")
      .csv("users", "data/users.csv");

    // Engineer features for a fraud detection model
    DataFrame features = db.query("""
        WITH user_stats AS (
            SELECT user_id,
                   COUNT(*) AS total_tx,
                   AVG(amount) AS avg_amount,
                   STDDEV(amount) AS std_amount,
                   MAX(amount) AS max_amount
            FROM transactions
            GROUP BY user_id
        )
        SELECT t.transaction_id,
               t.amount,
               t.amount / us.avg_amount AS amount_ratio,
               (t.amount - us.avg_amount) / NULLIF(us.std_amount, 0) AS z_score,
               us.total_tx AS user_total_tx,
               u.account_age_days,
               t.is_fraud AS label
        FROM transactions t
        JOIN user_stats us ON t.user_id = us.user_id
        JOIN users u ON t.user_id = u.user_id
    """);

    // Convert to ML-ready arrays
    double[][] X = features.select("amount", "amount_ratio", "z_score",
                                    "user_total_tx", "account_age_days").toArray();
    System.out.println("Feature matrix: " + X.length + " x " + X[0].length);
}

Using execute() for DDL Statements

import smile.data.SQL;

try (var db = new SQL()) {
    // Create a table with explicit schema
    db.execute("""
        CREATE TABLE experiments (
            experiment_id INTEGER PRIMARY KEY,
            model_name VARCHAR,
            accuracy DOUBLE,
            f1_score DOUBLE,
            training_time_ms BIGINT,
            created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
        )
    """);

    // Insert experiment results
    int rows = db.update("""
        INSERT INTO experiments (experiment_id, model_name, accuracy, f1_score, training_time_ms)
        VALUES (1, 'RandomForest', 0.945, 0.932, 1200),
               (2, 'GradientBoosting', 0.961, 0.955, 3400),
               (3, 'SVM', 0.938, 0.921, 8500)
    """);
    System.out.println("Inserted " + rows + " experiment records");

    var leaderboard = db.query("SELECT * FROM experiments ORDER BY accuracy DESC");
    System.out.println(leaderboard);
}

Method Selection Guide

Use Case	Method	Example
Read data, compute aggregations	`query()`	`SELECT ... FROM ... GROUP BY ...`
Insert new rows	`update()`	`INSERT INTO ... VALUES ...`
Delete rows (e.g., outlier removal)	`update()`	`DELETE FROM ... WHERE ...`
Create/drop tables	`execute()`	`CREATE TABLE ...`, `DROP TABLE ...`
Install/load extensions	`execute()`	`INSTALL httpfs;`
Any statement where you are unsure	`execute()`	Works for any SQL statement type.

Principle:Haifengl_Smile_SQL_Query_Execution - The theoretical basis for SQL query execution and relational algebra.
Implementation:Haifengl_Smile_SQL_Constructor - Queries require an active SQL connection.
Implementation:Haifengl_Smile_SQL_Data_Loaders - Tables must be loaded before they can be queried.
Implementation:Haifengl_Smile_DataFrame_To_Numerical_For_SQL - Query results (DataFrames) are converted to numerical arrays for ML.

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment