Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Haifengl Smile SQL Data Loaders

From Leeroopedia


Overview

The SQL Data Loaders are a set of methods on the smile.data.SQL class that create in-memory tables from external file sources. Each method generates a CREATE TABLE AS SELECT * FROM read_xxx(...) statement, delegating file parsing entirely to the DuckDB engine. All methods return this to support fluent method chaining.

API Signature

public class SQL implements AutoCloseable {
    /**
     * Creates an in-memory table from CSV files with comma delimiter.
     * @param name the table name.
     * @param path the CSV file path(s) (supports glob patterns).
     * @return this object.
     */
    public SQL csv(String name, String... path) throws SQLException

    /**
     * Creates an in-memory table from CSV files with custom delimiter and columns.
     * @param name the table name.
     * @param delimiter the delimiter character.
     * @param columns optional map of column names to column types.
     * @param path the CSV file path(s).
     * @return this object.
     */
    public SQL csv(String name, char delimiter, Map<String, String> columns, String... path) throws SQLException

    /**
     * Creates an in-memory table from Parquet files.
     * @param name the table name.
     * @param path the Parquet file path(s) (supports glob patterns).
     * @return this object.
     */
    public SQL parquet(String name, String... path) throws SQLException

    /**
     * Creates an in-memory table from Parquet files with options.
     * @param name the table name.
     * @param options map of read options (e.g., hive_partitioning, union_by_name).
     * @param path the Parquet file path(s).
     * @return this object.
     */
    public SQL parquet(String name, Map<String, String> options, String... path) throws SQLException

    /**
     * Creates an in-memory table from an Iceberg table.
     * @param name the table name.
     * @param path the path to the Iceberg table folder.
     * @return this object.
     */
    public SQL iceberg(String name, String path) throws SQLException

    /**
     * Creates an in-memory table from an Iceberg table with moved-paths support.
     * @param name the table name.
     * @param path the path to the Iceberg table folder.
     * @param allowMovedPaths whether to allow scanning moved Iceberg tables.
     * @return this object.
     */
    public SQL iceberg(String name, String path, boolean allowMovedPaths) throws SQLException

    /**
     * Creates an in-memory table from JSON files.
     * @param name the table name.
     * @param path the JSON file path(s).
     * @return this object.
     */
    public SQL json(String name, String... path) throws SQLException

    /**
     * Creates an in-memory table from JSON files with format and column spec.
     * @param name the table name.
     * @param format "auto", "unstructured", "newline_delimited", or "array".
     * @param columns optional map of column names to types.
     * @param path the JSON file path(s).
     * @return this object.
     */
    public SQL json(String name, String format, Map<String, String> columns, String... path) throws SQLException
}

Source

base/src/main/java/smile/data/SQL.java:L159-343

Import

import smile.data.SQL;
import java.util.Map;

Type

API Doc

Inputs

Parameter Type Required Description
name String Yes The name of the table to create in the database.
path String... (varargs) Yes One or more file paths. Supports glob patterns (e.g., "data/*.csv"). Multiple files are unioned if they share the same schema.
delimiter char No (CSV only) The column delimiter character. Defaults to ','.
columns Map<String, String> No Explicit column specification mapping column names to DuckDB type names.
options Map<String, String> No (Parquet only) Reader options such as binary_as_string, hive_partitioning, union_by_name.
allowMovedPaths boolean No (Iceberg only) Whether to allow scanning Iceberg tables that have been relocated.
format String No (JSON only) JSON format: "auto", "newline_delimited", "array", or "unstructured".

Outputs

Return Type Description
SQL Returns this for method chaining (fluent builder pattern).

Each method creates a new table in the in-memory database as a side effect. The table persists for the lifetime of the SQL connection.

Internal Mechanism

Each loader method constructs a SQL string using DuckDB's native file-reading functions:

Method Generated SQL Pattern
csv() CREATE TABLE {name} AS SELECT * FROM read_csv([paths], delim=',', header=true)
parquet() CREATE TABLE {name} AS SELECT * FROM read_parquet([paths], options...)
iceberg() CREATE TABLE {name} AS SELECT * FROM iceberg_scan('path', allow_moved_paths = false)
json() CREATE TABLE {name} AS SELECT * FROM read_json([paths], format = 'auto')

File paths are formatted into a DuckDB list literal (e.g., ['file1.csv', 'file2.csv']) via the private fileList() helper method. Column specifications are formatted into a DuckDB struct literal via the private columnList() helper.

All generated SQL statements are logged at INFO level via SLF4J before execution.

Usage Examples

Loading a Single CSV File

import smile.data.SQL;
import smile.data.DataFrame;

try (var db = new SQL()) {
    // Load a CSV file with auto-detected schema
    db.csv("iris", "data/iris.csv");

    // Query the loaded table
    DataFrame df = db.query("SELECT species, AVG(sepal_length) AS avg_sl FROM iris GROUP BY species");
    System.out.println(df);
}

Loading Multiple CSV Files with Glob Pattern

import smile.data.SQL;
import smile.data.DataFrame;

try (var db = new SQL()) {
    // Load all CSV files matching the glob pattern into one table
    db.csv("logs", "data/logs_2024_*.csv");

    DataFrame count = db.query("SELECT COUNT(*) AS total_rows FROM logs");
    System.out.println(count);
}

Loading CSV with Custom Delimiter and Explicit Schema

import smile.data.SQL;
import java.util.Map;

try (var db = new SQL()) {
    // Tab-delimited file with explicit column types
    Map<String, String> columns = Map.of(
        "id", "INTEGER",
        "name", "VARCHAR",
        "score", "DOUBLE",
        "active", "BOOLEAN"
    );
    db.csv("users", '\t', columns, "data/users.tsv");

    System.out.println(db.describe("users"));
}

Loading Parquet Files with Hive Partitioning

import smile.data.SQL;
import java.util.Map;

try (var db = new SQL()) {
    // Read Hive-partitioned Parquet files
    Map<String, String> opts = Map.of("hive_partitioning", "true");
    db.parquet("events", opts, "data/events/year=*/month=*/*.parquet");

    var result = db.query("""
        SELECT year, month, COUNT(*) AS event_count
        FROM events
        GROUP BY year, month
        ORDER BY year, month
    """);
    System.out.println(result);
}

Loading Iceberg Table

import smile.data.SQL;

try (var db = new SQL()) {
    // The Iceberg extension must be installed and loaded first
    db.installExtension("iceberg");
    db.loadExtension("iceberg");

    // Load an Iceberg table
    db.iceberg("warehouse_data", "/data/iceberg/warehouse");

    var summary = db.query("SELECT product_category, SUM(revenue) AS total FROM warehouse_data GROUP BY product_category");
    System.out.println(summary);
}

Fluent Chaining: Loading Multiple Data Sources

import smile.data.SQL;

try (var db = new SQL()) {
    // Load multiple data sources in a fluent chain
    db.csv("customers", "data/customers.csv")
      .parquet("orders", "data/orders/*.parquet")
      .csv("products", "data/products.csv");

    // Join across the loaded tables
    var result = db.query("""
        SELECT c.name, p.product_name, o.quantity, o.total_price
        FROM orders o
        JOIN customers c ON o.customer_id = c.id
        JOIN products p ON o.product_id = p.id
        WHERE o.total_price > 100
        ORDER BY o.total_price DESC
        LIMIT 20
    """);
    System.out.println(result);
}

Parquet Reader Options

The parquet() method with options supports the following DuckDB Parquet reader parameters:

Option Type Description
binary_as_string boolean Load binary columns as strings (for legacy Parquet writers that do not set the UTF8 flag).
filename boolean Include an extra filename column with the source file path.
file_row_number boolean Include a file_row_number column with the row number within each file.
hive_partitioning boolean Interpret the directory structure as Hive partitioning (e.g., year=2024/month=01).
union_by_name boolean Unify columns from multiple files by name rather than by position.

Related

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment