Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Haifengl Smile SQL Analytics Pipeline

From Leeroopedia


Knowledge Sources
Domains Data_Engineering, Analytics
Last Updated 2026-02-08 21:00 GMT

Overview

End-to-end process for performing SQL-based analytics on tabular data using Smile's DuckDB-powered SQL interface, from loading data sources into in-memory tables to querying and retrieving results as DataFrames.

Description

This workflow outlines the procedure for using Smile's built-in SQL class, which wraps an embedded DuckDB database engine. It allows users to load data from CSV, Parquet, and Iceberg sources into in-memory tables, then run standard SQL queries against them. Query results are returned as Smile DataFrames, enabling seamless integration with the rest of the Smile machine learning pipeline. The SQL interface also supports extensions, metadata inspection, and persistent databases.

Usage

Execute this workflow when you need to perform SQL-based data exploration, aggregation, or transformation on one or more data files before feeding results into machine learning algorithms. This is particularly useful when working with multiple data sources that need to be joined, filtered, or aggregated using familiar SQL syntax.

Execution Steps

Step 1: Open a SQL Database

Create an in-memory or persistent DuckDB database instance through the SQL class. An in-memory database is ephemeral and suitable for ad-hoc analysis. A persistent database stores data to disk for later reuse.

Key considerations:

  • The default constructor creates an in-memory database
  • Pass a file path to the constructor for a persistent database
  • SQL implements AutoCloseable, so use try-with-resources for proper cleanup
  • DuckDB extensions can be installed and loaded for additional functionality (e.g., Iceberg, spatial)

Step 2: Load Data Sources into Tables

Import data files into in-memory tables using format-specific loader methods. The SQL class provides csv(), parquet(), and iceberg() methods that create tables from external files. Multiple files with the same schema can be loaded into a single table using glob patterns.

Key considerations:

  • csv() supports custom delimiters, headers, and column type specifications
  • parquet() supports options like hive_partitioning and union_by_name
  • iceberg() supports reading Iceberg table format with versioning
  • Multiple files can be loaded into one table if they share the same schema
  • Tables can also be created from raw SQL CREATE TABLE statements

Step 3: Explore Table Metadata

Inspect the available tables and their column schemas before querying. Use tables() to list all tables in the database and describe() to view column names, types, and nullability for a specific table.

Key considerations:

  • tables() returns a DataFrame with table metadata
  • describe(tableName) returns column-level metadata (name, type, nullable)
  • extensions() shows installed DuckDB extensions

Step 4: Execute SQL Queries

Run SQL queries against the in-memory tables. DuckDB supports a rich SQL dialect including window functions, CTEs, aggregations, joins, and subqueries. Query results are automatically converted to Smile DataFrames.

Key considerations:

  • The query() method returns results as a DataFrame
  • The execute() method runs DDL or DML statements without returning results
  • DuckDB supports standard SQL including window functions, CTEs, and complex joins
  • Queries can reference multiple tables for joins and unions

Step 5: Use Results in ML Pipeline

Take the DataFrame results from SQL queries and feed them into the Smile machine learning pipeline. Convert to arrays or matrices, apply additional transformations, or pass directly to algorithms that accept DataFrames.

Key considerations:

  • Query results are standard Smile DataFrames compatible with all downstream operations
  • Close the SQL instance when done to release DuckDB resources
  • For large datasets, SQL-level filtering and aggregation is more efficient than in-memory DataFrame operations

Execution Diagram

GitHub URL

Workflow Repository