Implementation:Haifengl Smile SQL Data Loaders
Overview
The SQL Data Loaders are a set of methods on the smile.data.SQL class that create in-memory tables from external file sources. Each method generates a CREATE TABLE AS SELECT * FROM read_xxx(...) statement, delegating file parsing entirely to the DuckDB engine. All methods return this to support fluent method chaining.
API Signature
public class SQL implements AutoCloseable {
/**
* Creates an in-memory table from CSV files with comma delimiter.
* @param name the table name.
* @param path the CSV file path(s) (supports glob patterns).
* @return this object.
*/
public SQL csv(String name, String... path) throws SQLException
/**
* Creates an in-memory table from CSV files with custom delimiter and columns.
* @param name the table name.
* @param delimiter the delimiter character.
* @param columns optional map of column names to column types.
* @param path the CSV file path(s).
* @return this object.
*/
public SQL csv(String name, char delimiter, Map<String, String> columns, String... path) throws SQLException
/**
* Creates an in-memory table from Parquet files.
* @param name the table name.
* @param path the Parquet file path(s) (supports glob patterns).
* @return this object.
*/
public SQL parquet(String name, String... path) throws SQLException
/**
* Creates an in-memory table from Parquet files with options.
* @param name the table name.
* @param options map of read options (e.g., hive_partitioning, union_by_name).
* @param path the Parquet file path(s).
* @return this object.
*/
public SQL parquet(String name, Map<String, String> options, String... path) throws SQLException
/**
* Creates an in-memory table from an Iceberg table.
* @param name the table name.
* @param path the path to the Iceberg table folder.
* @return this object.
*/
public SQL iceberg(String name, String path) throws SQLException
/**
* Creates an in-memory table from an Iceberg table with moved-paths support.
* @param name the table name.
* @param path the path to the Iceberg table folder.
* @param allowMovedPaths whether to allow scanning moved Iceberg tables.
* @return this object.
*/
public SQL iceberg(String name, String path, boolean allowMovedPaths) throws SQLException
/**
* Creates an in-memory table from JSON files.
* @param name the table name.
* @param path the JSON file path(s).
* @return this object.
*/
public SQL json(String name, String... path) throws SQLException
/**
* Creates an in-memory table from JSON files with format and column spec.
* @param name the table name.
* @param format "auto", "unstructured", "newline_delimited", or "array".
* @param columns optional map of column names to types.
* @param path the JSON file path(s).
* @return this object.
*/
public SQL json(String name, String format, Map<String, String> columns, String... path) throws SQLException
}
Source
base/src/main/java/smile/data/SQL.java:L159-343
Import
import smile.data.SQL;
import java.util.Map;
Type
API Doc
Inputs
| Parameter | Type | Required | Description |
|---|---|---|---|
name |
String |
Yes | The name of the table to create in the database. |
path |
String... (varargs) |
Yes | One or more file paths. Supports glob patterns (e.g., "data/*.csv"). Multiple files are unioned if they share the same schema.
|
delimiter |
char |
No (CSV only) | The column delimiter character. Defaults to ','.
|
columns |
Map<String, String> |
No | Explicit column specification mapping column names to DuckDB type names. |
options |
Map<String, String> |
No (Parquet only) | Reader options such as binary_as_string, hive_partitioning, union_by_name.
|
allowMovedPaths |
boolean |
No (Iceberg only) | Whether to allow scanning Iceberg tables that have been relocated. |
format |
String |
No (JSON only) | JSON format: "auto", "newline_delimited", "array", or "unstructured".
|
Outputs
| Return Type | Description |
|---|---|
SQL |
Returns this for method chaining (fluent builder pattern).
|
Each method creates a new table in the in-memory database as a side effect. The table persists for the lifetime of the SQL connection.
Internal Mechanism
Each loader method constructs a SQL string using DuckDB's native file-reading functions:
| Method | Generated SQL Pattern |
|---|---|
csv() |
CREATE TABLE {name} AS SELECT * FROM read_csv([paths], delim=',', header=true)
|
parquet() |
CREATE TABLE {name} AS SELECT * FROM read_parquet([paths], options...)
|
iceberg() |
CREATE TABLE {name} AS SELECT * FROM iceberg_scan('path', allow_moved_paths = false)
|
json() |
CREATE TABLE {name} AS SELECT * FROM read_json([paths], format = 'auto')
|
File paths are formatted into a DuckDB list literal (e.g., ['file1.csv', 'file2.csv']) via the private fileList() helper method. Column specifications are formatted into a DuckDB struct literal via the private columnList() helper.
All generated SQL statements are logged at INFO level via SLF4J before execution.
Usage Examples
Loading a Single CSV File
import smile.data.SQL;
import smile.data.DataFrame;
try (var db = new SQL()) {
// Load a CSV file with auto-detected schema
db.csv("iris", "data/iris.csv");
// Query the loaded table
DataFrame df = db.query("SELECT species, AVG(sepal_length) AS avg_sl FROM iris GROUP BY species");
System.out.println(df);
}
Loading Multiple CSV Files with Glob Pattern
import smile.data.SQL;
import smile.data.DataFrame;
try (var db = new SQL()) {
// Load all CSV files matching the glob pattern into one table
db.csv("logs", "data/logs_2024_*.csv");
DataFrame count = db.query("SELECT COUNT(*) AS total_rows FROM logs");
System.out.println(count);
}
Loading CSV with Custom Delimiter and Explicit Schema
import smile.data.SQL;
import java.util.Map;
try (var db = new SQL()) {
// Tab-delimited file with explicit column types
Map<String, String> columns = Map.of(
"id", "INTEGER",
"name", "VARCHAR",
"score", "DOUBLE",
"active", "BOOLEAN"
);
db.csv("users", '\t', columns, "data/users.tsv");
System.out.println(db.describe("users"));
}
Loading Parquet Files with Hive Partitioning
import smile.data.SQL;
import java.util.Map;
try (var db = new SQL()) {
// Read Hive-partitioned Parquet files
Map<String, String> opts = Map.of("hive_partitioning", "true");
db.parquet("events", opts, "data/events/year=*/month=*/*.parquet");
var result = db.query("""
SELECT year, month, COUNT(*) AS event_count
FROM events
GROUP BY year, month
ORDER BY year, month
""");
System.out.println(result);
}
Loading Iceberg Table
import smile.data.SQL;
try (var db = new SQL()) {
// The Iceberg extension must be installed and loaded first
db.installExtension("iceberg");
db.loadExtension("iceberg");
// Load an Iceberg table
db.iceberg("warehouse_data", "/data/iceberg/warehouse");
var summary = db.query("SELECT product_category, SUM(revenue) AS total FROM warehouse_data GROUP BY product_category");
System.out.println(summary);
}
Fluent Chaining: Loading Multiple Data Sources
import smile.data.SQL;
try (var db = new SQL()) {
// Load multiple data sources in a fluent chain
db.csv("customers", "data/customers.csv")
.parquet("orders", "data/orders/*.parquet")
.csv("products", "data/products.csv");
// Join across the loaded tables
var result = db.query("""
SELECT c.name, p.product_name, o.quantity, o.total_price
FROM orders o
JOIN customers c ON o.customer_id = c.id
JOIN products p ON o.product_id = p.id
WHERE o.total_price > 100
ORDER BY o.total_price DESC
LIMIT 20
""");
System.out.println(result);
}
Parquet Reader Options
The parquet() method with options supports the following DuckDB Parquet reader parameters:
| Option | Type | Description |
|---|---|---|
binary_as_string |
boolean | Load binary columns as strings (for legacy Parquet writers that do not set the UTF8 flag). |
filename |
boolean | Include an extra filename column with the source file path.
|
file_row_number |
boolean | Include a file_row_number column with the row number within each file.
|
hive_partitioning |
boolean | Interpret the directory structure as Hive partitioning (e.g., year=2024/month=01).
|
union_by_name |
boolean | Unify columns from multiple files by name rather than by position. |
Related
- Principle:Haifengl_Smile_SQL_Data_Ingestion - The theoretical basis for direct file-to-table ingestion.
- Implementation:Haifengl_Smile_SQL_Constructor - The SQL instance must be constructed before data can be loaded.
- Implementation:Haifengl_Smile_SQL_Metadata_API - After loading, metadata methods can inspect the resulting table schemas.
- Implementation:Haifengl_Smile_SQL_Query_API - Loaded tables are queried via query(), execute(), and update().