Workflow:Haifengl Smile SQL Analytics Pipeline

Knowledge Sources	Smile DuckDB
Domains	Data_Engineering, Analytics
Last Updated	2026-02-08 21:00 GMT

Overview

End-to-end process for performing SQL-based analytics on tabular data using Smile's DuckDB-powered SQL interface, from loading data sources into in-memory tables to querying and retrieving results as DataFrames.

Description

This workflow outlines the procedure for using Smile's built-in SQL class, which wraps an embedded DuckDB database engine. It allows users to load data from CSV, Parquet, and Iceberg sources into in-memory tables, then run standard SQL queries against them. Query results are returned as Smile DataFrames, enabling seamless integration with the rest of the Smile machine learning pipeline. The SQL interface also supports extensions, metadata inspection, and persistent databases.

Usage

Execute this workflow when you need to perform SQL-based data exploration, aggregation, or transformation on one or more data files before feeding results into machine learning algorithms. This is particularly useful when working with multiple data sources that need to be joined, filtered, or aggregated using familiar SQL syntax.

Execution Steps

Step 1: Open a SQL Database

Create an in-memory or persistent DuckDB database instance through the SQL class. An in-memory database is ephemeral and suitable for ad-hoc analysis. A persistent database stores data to disk for later reuse.

Key considerations:

The default constructor creates an in-memory database
Pass a file path to the constructor for a persistent database
SQL implements AutoCloseable, so use try-with-resources for proper cleanup
DuckDB extensions can be installed and loaded for additional functionality (e.g., Iceberg, spatial)

Step 2: Load Data Sources into Tables

Import data files into in-memory tables using format-specific loader methods. The SQL class provides csv(), parquet(), and iceberg() methods that create tables from external files. Multiple files with the same schema can be loaded into a single table using glob patterns.

Key considerations:

csv() supports custom delimiters, headers, and column type specifications
parquet() supports options like hive_partitioning and union_by_name
iceberg() supports reading Iceberg table format with versioning
Multiple files can be loaded into one table if they share the same schema
Tables can also be created from raw SQL CREATE TABLE statements

Step 3: Explore Table Metadata

Inspect the available tables and their column schemas before querying. Use tables() to list all tables in the database and describe() to view column names, types, and nullability for a specific table.

Key considerations:

tables() returns a DataFrame with table metadata
describe(tableName) returns column-level metadata (name, type, nullable)
extensions() shows installed DuckDB extensions

Step 4: Execute SQL Queries

Run SQL queries against the in-memory tables. DuckDB supports a rich SQL dialect including window functions, CTEs, aggregations, joins, and subqueries. Query results are automatically converted to Smile DataFrames.

Key considerations:

The query() method returns results as a DataFrame
The execute() method runs DDL or DML statements without returning results
DuckDB supports standard SQL including window functions, CTEs, and complex joins
Queries can reference multiple tables for joins and unions

Step 5: Use Results in ML Pipeline

Take the DataFrame results from SQL queries and feed them into the Smile machine learning pipeline. Convert to arrays or matrices, apply additional transformations, or pass directly to algorithms that accept DataFrames.

Key considerations:

Query results are standard Smile DataFrames compatible with all downstream operations
Close the SQL instance when done to release DuckDB resources
For large datasets, SQL-level filtering and aggregation is more efficient than in-memory DataFrame operations

Execution Diagram

GitHub URL

Workflow Repository