Implementation:Haifengl Smile SQL Data Loaders

Overview

The SQL Data Loaders are a set of methods on the smile.data.SQL class that create in-memory tables from external file sources. Each method generates a CREATE TABLE AS SELECT * FROM read_xxx(...) statement, delegating file parsing entirely to the DuckDB engine. All methods return this to support fluent method chaining.

API Signature

public class SQL implements AutoCloseable {
    /**
     * Creates an in-memory table from CSV files with comma delimiter.
     * @param name the table name.
     * @param path the CSV file path(s) (supports glob patterns).
     * @return this object.
     */
    public SQL csv(String name, String... path) throws SQLException

    /**
     * Creates an in-memory table from CSV files with custom delimiter and columns.
     * @param name the table name.
     * @param delimiter the delimiter character.
     * @param columns optional map of column names to column types.
     * @param path the CSV file path(s).
     * @return this object.
     */
    public SQL csv(String name, char delimiter, Map<String, String> columns, String... path) throws SQLException

    /**
     * Creates an in-memory table from Parquet files.
     * @param name the table name.
     * @param path the Parquet file path(s) (supports glob patterns).
     * @return this object.
     */
    public SQL parquet(String name, String... path) throws SQLException

    /**
     * Creates an in-memory table from Parquet files with options.
     * @param name the table name.
     * @param options map of read options (e.g., hive_partitioning, union_by_name).
     * @param path the Parquet file path(s).
     * @return this object.
     */
    public SQL parquet(String name, Map<String, String> options, String... path) throws SQLException

    /**
     * Creates an in-memory table from an Iceberg table.
     * @param name the table name.
     * @param path the path to the Iceberg table folder.
     * @return this object.
     */
    public SQL iceberg(String name, String path) throws SQLException

    /**
     * Creates an in-memory table from an Iceberg table with moved-paths support.
     * @param name the table name.
     * @param path the path to the Iceberg table folder.
     * @param allowMovedPaths whether to allow scanning moved Iceberg tables.
     * @return this object.
     */
    public SQL iceberg(String name, String path, boolean allowMovedPaths) throws SQLException

    /**
     * Creates an in-memory table from JSON files.
     * @param name the table name.
     * @param path the JSON file path(s).
     * @return this object.
     */
    public SQL json(String name, String... path) throws SQLException

    /**
     * Creates an in-memory table from JSON files with format and column spec.
     * @param name the table name.
     * @param format "auto", "unstructured", "newline_delimited", or "array".
     * @param columns optional map of column names to types.
     * @param path the JSON file path(s).
     * @return this object.
     */
    public SQL json(String name, String format, Map<String, String> columns, String... path) throws SQLException
}

Source

base/src/main/java/smile/data/SQL.java:L159-343

Import

import smile.data.SQL;
import java.util.Map;

Type

API Doc

Inputs

Parameter	Type	Required	Description
`name`	`String`	Yes	The name of the table to create in the database.
`path`	`String...` (varargs)	Yes	One or more file paths. Supports glob patterns (e.g., `"data/*.csv"`). Multiple files are unioned if they share the same schema.
`delimiter`	`char`	No (CSV only)	The column delimiter character. Defaults to `','`.
`columns`	`Map<String, String>`	No	Explicit column specification mapping column names to DuckDB type names.
`options`	`Map<String, String>`	No (Parquet only)	Reader options such as `binary_as_string`, `hive_partitioning`, `union_by_name`.
`allowMovedPaths`	`boolean`	No (Iceberg only)	Whether to allow scanning Iceberg tables that have been relocated.
`format`	`String`	No (JSON only)	JSON format: `"auto"`, `"newline_delimited"`, `"array"`, or `"unstructured"`.

Outputs

Return Type	Description
`SQL`	Returns `this` for method chaining (fluent builder pattern).

Each method creates a new table in the in-memory database as a side effect. The table persists for the lifetime of the SQL connection.

Internal Mechanism

Each loader method constructs a SQL string using DuckDB's native file-reading functions:

Method	Generated SQL Pattern
`csv()`	`CREATE TABLE {name} AS SELECT * FROM read_csv([paths], delim=',', header=true)`
`parquet()`	`CREATE TABLE {name} AS SELECT * FROM read_parquet([paths], options...)`
`iceberg()`	`CREATE TABLE {name} AS SELECT * FROM iceberg_scan('path', allow_moved_paths = false)`
`json()`	`CREATE TABLE {name} AS SELECT * FROM read_json([paths], format = 'auto')`

File paths are formatted into a DuckDB list literal (e.g., ['file1.csv', 'file2.csv']) via the private fileList() helper method. Column specifications are formatted into a DuckDB struct literal via the private columnList() helper.

All generated SQL statements are logged at INFO level via SLF4J before execution.

Usage Examples

Loading a Single CSV File

import smile.data.SQL;
import smile.data.DataFrame;

try (var db = new SQL()) {
    // Load a CSV file with auto-detected schema
    db.csv("iris", "data/iris.csv");

    // Query the loaded table
    DataFrame df = db.query("SELECT species, AVG(sepal_length) AS avg_sl FROM iris GROUP BY species");
    System.out.println(df);
}

Loading Multiple CSV Files with Glob Pattern

import smile.data.SQL;
import smile.data.DataFrame;

try (var db = new SQL()) {
    // Load all CSV files matching the glob pattern into one table
    db.csv("logs", "data/logs_2024_*.csv");

    DataFrame count = db.query("SELECT COUNT(*) AS total_rows FROM logs");
    System.out.println(count);
}

Loading CSV with Custom Delimiter and Explicit Schema

import smile.data.SQL;
import java.util.Map;

try (var db = new SQL()) {
    // Tab-delimited file with explicit column types
    Map<String, String> columns = Map.of(
        "id", "INTEGER",
        "name", "VARCHAR",
        "score", "DOUBLE",
        "active", "BOOLEAN"
    );
    db.csv("users", '\t', columns, "data/users.tsv");

    System.out.println(db.describe("users"));
}

Loading Parquet Files with Hive Partitioning

import smile.data.SQL;
import java.util.Map;

try (var db = new SQL()) {
    // Read Hive-partitioned Parquet files
    Map<String, String> opts = Map.of("hive_partitioning", "true");
    db.parquet("events", opts, "data/events/year=*/month=*/*.parquet");

    var result = db.query("""
        SELECT year, month, COUNT(*) AS event_count
        FROM events
        GROUP BY year, month
        ORDER BY year, month
    """);
    System.out.println(result);
}

Loading Iceberg Table

import smile.data.SQL;

try (var db = new SQL()) {
    // The Iceberg extension must be installed and loaded first
    db.installExtension("iceberg");
    db.loadExtension("iceberg");

    // Load an Iceberg table
    db.iceberg("warehouse_data", "/data/iceberg/warehouse");

    var summary = db.query("SELECT product_category, SUM(revenue) AS total FROM warehouse_data GROUP BY product_category");
    System.out.println(summary);
}

Fluent Chaining: Loading Multiple Data Sources

import smile.data.SQL;

try (var db = new SQL()) {
    // Load multiple data sources in a fluent chain
    db.csv("customers", "data/customers.csv")
      .parquet("orders", "data/orders/*.parquet")
      .csv("products", "data/products.csv");

    // Join across the loaded tables
    var result = db.query("""
        SELECT c.name, p.product_name, o.quantity, o.total_price
        FROM orders o
        JOIN customers c ON o.customer_id = c.id
        JOIN products p ON o.product_id = p.id
        WHERE o.total_price > 100
        ORDER BY o.total_price DESC
        LIMIT 20
    """);
    System.out.println(result);
}

Parquet Reader Options

The parquet() method with options supports the following DuckDB Parquet reader parameters:

Option	Type	Description
`binary_as_string`	boolean	Load binary columns as strings (for legacy Parquet writers that do not set the UTF8 flag).
`filename`	boolean	Include an extra `filename` column with the source file path.
`file_row_number`	boolean	Include a `file_row_number` column with the row number within each file.
`hive_partitioning`	boolean	Interpret the directory structure as Hive partitioning (e.g., `year=2024/month=01`).
`union_by_name`	boolean	Unify columns from multiple files by name rather than by position.

Principle:Haifengl_Smile_SQL_Data_Ingestion - The theoretical basis for direct file-to-table ingestion.
Implementation:Haifengl_Smile_SQL_Constructor - The SQL instance must be constructed before data can be loaded.
Implementation:Haifengl_Smile_SQL_Metadata_API - After loading, metadata methods can inspect the resulting table schemas.
Implementation:Haifengl_Smile_SQL_Query_API - Loaded tables are queried via query(), execute(), and update().

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment