Implementation:Haifengl Smile DataFrame To Numerical For SQL
Overview
The DataFrame-to-Numerical conversion methods on smile.data.DataFrame convert SQL query results into ML-ready numerical arrays (double[][]) or dense matrices (DenseMatrix). This is an angle-conditioned documentation of the same toArray() and toMatrix() API used for general numerical conversion, but viewed specifically from the SQL analytics pipeline perspective -- where DataFrames originate from SQL.query() rather than file readers.
API Signature
public record DataFrame(StructType schema, List<ValueVector> columns, RowIndex index)
implements Iterable<Row>, Serializable {
/**
* Converts specified columns to a double array. Uses LEVEL encoding
* for categorical variables, no bias term.
* @param columns the columns to export. If empty, all columns are used.
* @return the numeric array.
*/
public double[][] toArray(String... columns)
/**
* Converts specified columns to a double array with encoding options.
* @param bias if true, prepend a column of all 1's (intercept term).
* @param encoder the categorical variable encoder (LEVEL, DUMMY, ONE_HOT).
* @param names the columns to export. If empty, all columns are used.
* @return the numeric array.
*/
public double[][] toArray(boolean bias, CategoricalEncoder encoder, String... names)
/**
* Converts all columns to a DenseMatrix. Uses LEVEL encoding,
* no bias term, no row names.
* @return the numeric matrix.
*/
public DenseMatrix toMatrix()
/**
* Converts all columns to a DenseMatrix with encoding options.
* @param bias if true, prepend a column of all 1's (intercept term).
* @param encoder the categorical variable encoder (LEVEL, DUMMY, ONE_HOT).
* @param rowNames the column to use as row names (excluded from data). Can be null.
* @return the numeric matrix.
*/
public DenseMatrix toMatrix(boolean bias, CategoricalEncoder encoder, String rowNames)
}
Source
base/src/main/java/smile/data/DataFrame.java:L742-925
Import
import smile.data.DataFrame;
import smile.data.CategoricalEncoder;
import smile.tensor.DenseMatrix;
Type
API Doc (angle: SQL results to ML)
Inputs
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
columns / names |
String... |
No | All columns | Specific columns to include in the output. If empty, all columns are converted. |
bias |
boolean |
No | false |
If true, prepends a column of all 1.0 values (intercept/bias term).
|
encoder |
CategoricalEncoder |
No | LEVEL |
How to encode categorical variables: LEVEL, DUMMY, or ONE_HOT.
|
rowNames |
String |
No | null |
Column to use as row names in DenseMatrix output (column is excluded from data). |
The input is always the DataFrame instance itself, typically obtained from SQL.query().
Outputs
| Method | Return Type | Memory Layout | Description |
|---|---|---|---|
toArray() |
double[][] |
Row-major | Each row is a double[] of feature values. Suitable for general ML input.
|
toMatrix() |
DenseMatrix |
Column-major | A Smile DenseMatrix (Float64 type) with optional row/column names. Optimized for linear algebra. |
Both methods encode missing values/nulls as Double.NaN.
CategoricalEncoder Values
The CategoricalEncoder enum defines three encoding strategies:
public enum CategoricalEncoder {
LEVEL, // Use integer level value directly (1 column per feature)
DUMMY, // k-1 binary columns, dropping reference level
ONE_HOT // k binary columns, one per category level
}
| Encoder | Columns Generated | Linear Model Safe | Description |
|---|---|---|---|
LEVEL |
1 | Yes (ordinal assumption) | Direct integer encoding. Assumes ordinal relationship. |
DUMMY |
k - 1 | Yes (full rank) | Drops first category as reference. Avoids multicollinearity. |
ONE_HOT |
k | No (multicollinear with intercept) | Full binary expansion. Use without bias or with regularization. |
Internal Mechanism
The conversion process iterates through each specified column:
- If the column has a CategoricalMeasure and encoder is not LEVEL:
- For DUMMY encoding: Creates k-1 binary columns, setting 1.0 at the position corresponding to the category factor (skipping index 0).
- For ONE_HOT encoding: Creates k binary columns, setting 1.0 at the position corresponding to the category factor.
- Otherwise (numeric columns or LEVEL encoding):
- Calls
column.getDouble(i)for each row, which performs automatic type coercion from the column's native type to double.
- Calls
- If bias is true: Prepends a column of all 1.0 values at index 0.
For toMatrix(), the output is a DenseMatrix created via DenseMatrix.zeros(Float64, nrow, ncol) and populated element-by-element. Column names are set from the original DataFrame field names (with categorical expansions like "color_red"). Row names are optionally extracted from a specified column.
Usage Examples
Basic SQL-to-Array Pipeline
import smile.data.SQL;
import smile.data.DataFrame;
try (var db = new SQL()) {
db.csv("iris", "data/iris.csv");
// Query numeric features only
DataFrame features = db.query("""
SELECT sepal_length, sepal_width, petal_length, petal_width
FROM iris
""");
// Convert to double array for ML
double[][] X = features.toArray();
System.out.println("Shape: " + X.length + " x " + X[0].length);
// Output: Shape: 150 x 4
}
SQL Query to Feature Matrix with Categorical Encoding
import smile.data.SQL;
import smile.data.DataFrame;
import smile.data.CategoricalEncoder;
try (var db = new SQL()) {
db.csv("housing", "data/housing.csv");
// Query returns a mix of numeric and categorical columns
DataFrame raw = db.query("""
SELECT square_feet, bedrooms, neighborhood, has_garage, price
FROM housing
""");
// Factorize categorical string columns to integer-encoded columns
DataFrame encoded = raw.factorize("neighborhood");
// Separate features and target
DataFrame featureDf = encoded.select("square_feet", "bedrooms", "neighborhood", "has_garage");
double[] y = encoded.column("price").doubleStream().toArray();
// Convert features with DUMMY encoding and bias term for linear regression
double[][] X = featureDf.toArray(true, CategoricalEncoder.DUMMY);
System.out.println("Feature matrix: " + X.length + " x " + X[0].length);
System.out.println("First row intercept (bias): " + X[0][0]); // 1.0
}
SQL Aggregation Results to DenseMatrix
import smile.data.SQL;
import smile.data.DataFrame;
import smile.data.CategoricalEncoder;
import smile.tensor.DenseMatrix;
try (var db = new SQL()) {
db.parquet("metrics", "data/monthly_metrics.parquet");
// Compute a correlation-ready matrix from SQL aggregation
DataFrame monthly = db.query("""
SELECT month,
AVG(revenue) AS avg_revenue,
AVG(cost) AS avg_cost,
AVG(customer_count) AS avg_customers,
AVG(churn_rate) AS avg_churn
FROM metrics
GROUP BY month
ORDER BY month
""");
// Convert to DenseMatrix with month as row names
DenseMatrix matrix = monthly.toMatrix(false, CategoricalEncoder.LEVEL, "month");
System.out.println("Matrix dimensions: " + matrix.nrow() + " x " + matrix.ncol());
// Column names are preserved: avg_revenue, avg_cost, avg_customers, avg_churn
}
End-to-End SQL Analytics to Classification Pipeline
import smile.data.SQL;
import smile.data.DataFrame;
import smile.data.CategoricalEncoder;
import smile.classification.RandomForest;
try (var db = new SQL()) {
// Load and prepare data via SQL
db.csv("dataset", "data/credit_data.csv");
DataFrame trainData = db.query("""
SELECT age, income, debt_ratio, num_accounts,
credit_score, employment_years, default_flag
FROM dataset
WHERE split = 'train'
""");
// Extract features and labels
double[][] X = trainData.select("age", "income", "debt_ratio",
"num_accounts", "credit_score",
"employment_years").toArray();
int[] y = trainData.column("default_flag").intStream().toArray();
// Train a classifier
var model = RandomForest.fit(smile.data.formula.Formula.lhs("default_flag"), trainData);
System.out.println("Model trained on " + X.length + " samples with " + X[0].length + " features");
// Score test data via SQL query
DataFrame testData = db.query("""
SELECT age, income, debt_ratio, num_accounts,
credit_score, employment_years
FROM dataset
WHERE split = 'test'
""");
double[][] Xtest = testData.toArray();
System.out.println("Test set: " + Xtest.length + " samples");
}
Handling NULL Values from SQL Results
import smile.data.SQL;
import smile.data.DataFrame;
try (var db = new SQL()) {
db.csv("survey", "data/survey_results.csv");
// Query may contain NULLs
DataFrame raw = db.query("SELECT age, income, satisfaction_score FROM survey");
// Option 1: Drop rows with nulls before conversion
DataFrame clean = raw.dropna();
double[][] X1 = clean.toArray();
System.out.println("After dropna: " + X1.length + " rows");
// Option 2: Fill nulls with a default value before conversion
raw.fillna(0.0);
double[][] X2 = raw.toArray();
System.out.println("After fillna: " + X2.length + " rows");
// Option 3: Convert directly (NULLs become Double.NaN)
DataFrame raw2 = db.query("SELECT age, income, satisfaction_score FROM survey");
double[][] X3 = raw2.toArray();
// NaN values present -- handle during model training or preprocessing
}
Key Behaviors
- Column selection: When column names are specified, only those columns are included in the output. When omitted, all columns are converted.
- NULL mapping: SQL NULL values are encoded as
Double.NaNin the output array/matrix. - Categorical expansion: With DUMMY or ONE_HOT encoding, the output matrix may have more columns than the input DataFrame due to binary expansion of categorical variables.
- Column naming in DenseMatrix: Expanded categorical columns are named as
"original_name_level_value"(e.g.,"neighborhood_downtown"). - Bias column naming: When
bias=true, the first column is named"Intercept". - Matrix type:
toMatrix()creates aFloat64DenseMatrix (DenseMatrix.zeros(Float64, nrow, ncol)).
Related
- Principle:Haifengl_Smile_SQL_to_ML_Bridge - The theoretical basis for converting relational data to numerical arrays.
- Implementation:Haifengl_Smile_SQL_Query_API - The DataFrame inputs typically come from SQL.query().
- Implementation:Haifengl_Smile_SQL_Constructor - The full pipeline begins with SQL connection construction.
- Implementation:Haifengl_Smile_SQL_Data_Loaders - Data is loaded into the SQL engine before querying and conversion.