Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Haifengl Smile DataFrame To Numerical For SQL

From Leeroopedia


Overview

The DataFrame-to-Numerical conversion methods on smile.data.DataFrame convert SQL query results into ML-ready numerical arrays (double[][]) or dense matrices (DenseMatrix). This is an angle-conditioned documentation of the same toArray() and toMatrix() API used for general numerical conversion, but viewed specifically from the SQL analytics pipeline perspective -- where DataFrames originate from SQL.query() rather than file readers.

API Signature

public record DataFrame(StructType schema, List<ValueVector> columns, RowIndex index)
        implements Iterable<Row>, Serializable {

    /**
     * Converts specified columns to a double array. Uses LEVEL encoding
     * for categorical variables, no bias term.
     * @param columns the columns to export. If empty, all columns are used.
     * @return the numeric array.
     */
    public double[][] toArray(String... columns)

    /**
     * Converts specified columns to a double array with encoding options.
     * @param bias if true, prepend a column of all 1's (intercept term).
     * @param encoder the categorical variable encoder (LEVEL, DUMMY, ONE_HOT).
     * @param names the columns to export. If empty, all columns are used.
     * @return the numeric array.
     */
    public double[][] toArray(boolean bias, CategoricalEncoder encoder, String... names)

    /**
     * Converts all columns to a DenseMatrix. Uses LEVEL encoding,
     * no bias term, no row names.
     * @return the numeric matrix.
     */
    public DenseMatrix toMatrix()

    /**
     * Converts all columns to a DenseMatrix with encoding options.
     * @param bias if true, prepend a column of all 1's (intercept term).
     * @param encoder the categorical variable encoder (LEVEL, DUMMY, ONE_HOT).
     * @param rowNames the column to use as row names (excluded from data). Can be null.
     * @return the numeric matrix.
     */
    public DenseMatrix toMatrix(boolean bias, CategoricalEncoder encoder, String rowNames)
}

Source

base/src/main/java/smile/data/DataFrame.java:L742-925

Import

import smile.data.DataFrame;
import smile.data.CategoricalEncoder;
import smile.tensor.DenseMatrix;

Type

API Doc (angle: SQL results to ML)

Inputs

Parameter Type Required Default Description
columns / names String... No All columns Specific columns to include in the output. If empty, all columns are converted.
bias boolean No false If true, prepends a column of all 1.0 values (intercept/bias term).
encoder CategoricalEncoder No LEVEL How to encode categorical variables: LEVEL, DUMMY, or ONE_HOT.
rowNames String No null Column to use as row names in DenseMatrix output (column is excluded from data).

The input is always the DataFrame instance itself, typically obtained from SQL.query().

Outputs

Method Return Type Memory Layout Description
toArray() double[][] Row-major Each row is a double[] of feature values. Suitable for general ML input.
toMatrix() DenseMatrix Column-major A Smile DenseMatrix (Float64 type) with optional row/column names. Optimized for linear algebra.

Both methods encode missing values/nulls as Double.NaN.

CategoricalEncoder Values

The CategoricalEncoder enum defines three encoding strategies:

public enum CategoricalEncoder {
    LEVEL,    // Use integer level value directly (1 column per feature)
    DUMMY,    // k-1 binary columns, dropping reference level
    ONE_HOT   // k binary columns, one per category level
}
Encoder Columns Generated Linear Model Safe Description
LEVEL 1 Yes (ordinal assumption) Direct integer encoding. Assumes ordinal relationship.
DUMMY k - 1 Yes (full rank) Drops first category as reference. Avoids multicollinearity.
ONE_HOT k No (multicollinear with intercept) Full binary expansion. Use without bias or with regularization.

Internal Mechanism

The conversion process iterates through each specified column:

  1. If the column has a CategoricalMeasure and encoder is not LEVEL:
    • For DUMMY encoding: Creates k-1 binary columns, setting 1.0 at the position corresponding to the category factor (skipping index 0).
    • For ONE_HOT encoding: Creates k binary columns, setting 1.0 at the position corresponding to the category factor.
  2. Otherwise (numeric columns or LEVEL encoding):
    • Calls column.getDouble(i) for each row, which performs automatic type coercion from the column's native type to double.
  3. If bias is true: Prepends a column of all 1.0 values at index 0.

For toMatrix(), the output is a DenseMatrix created via DenseMatrix.zeros(Float64, nrow, ncol) and populated element-by-element. Column names are set from the original DataFrame field names (with categorical expansions like "color_red"). Row names are optionally extracted from a specified column.

Usage Examples

Basic SQL-to-Array Pipeline

import smile.data.SQL;
import smile.data.DataFrame;

try (var db = new SQL()) {
    db.csv("iris", "data/iris.csv");

    // Query numeric features only
    DataFrame features = db.query("""
        SELECT sepal_length, sepal_width, petal_length, petal_width
        FROM iris
    """);

    // Convert to double array for ML
    double[][] X = features.toArray();
    System.out.println("Shape: " + X.length + " x " + X[0].length);
    // Output: Shape: 150 x 4
}

SQL Query to Feature Matrix with Categorical Encoding

import smile.data.SQL;
import smile.data.DataFrame;
import smile.data.CategoricalEncoder;

try (var db = new SQL()) {
    db.csv("housing", "data/housing.csv");

    // Query returns a mix of numeric and categorical columns
    DataFrame raw = db.query("""
        SELECT square_feet, bedrooms, neighborhood, has_garage, price
        FROM housing
    """);

    // Factorize categorical string columns to integer-encoded columns
    DataFrame encoded = raw.factorize("neighborhood");

    // Separate features and target
    DataFrame featureDf = encoded.select("square_feet", "bedrooms", "neighborhood", "has_garage");
    double[] y = encoded.column("price").doubleStream().toArray();

    // Convert features with DUMMY encoding and bias term for linear regression
    double[][] X = featureDf.toArray(true, CategoricalEncoder.DUMMY);

    System.out.println("Feature matrix: " + X.length + " x " + X[0].length);
    System.out.println("First row intercept (bias): " + X[0][0]); // 1.0
}

SQL Aggregation Results to DenseMatrix

import smile.data.SQL;
import smile.data.DataFrame;
import smile.data.CategoricalEncoder;
import smile.tensor.DenseMatrix;

try (var db = new SQL()) {
    db.parquet("metrics", "data/monthly_metrics.parquet");

    // Compute a correlation-ready matrix from SQL aggregation
    DataFrame monthly = db.query("""
        SELECT month,
               AVG(revenue) AS avg_revenue,
               AVG(cost) AS avg_cost,
               AVG(customer_count) AS avg_customers,
               AVG(churn_rate) AS avg_churn
        FROM metrics
        GROUP BY month
        ORDER BY month
    """);

    // Convert to DenseMatrix with month as row names
    DenseMatrix matrix = monthly.toMatrix(false, CategoricalEncoder.LEVEL, "month");
    System.out.println("Matrix dimensions: " + matrix.nrow() + " x " + matrix.ncol());
    // Column names are preserved: avg_revenue, avg_cost, avg_customers, avg_churn
}

End-to-End SQL Analytics to Classification Pipeline

import smile.data.SQL;
import smile.data.DataFrame;
import smile.data.CategoricalEncoder;
import smile.classification.RandomForest;

try (var db = new SQL()) {
    // Load and prepare data via SQL
    db.csv("dataset", "data/credit_data.csv");

    DataFrame trainData = db.query("""
        SELECT age, income, debt_ratio, num_accounts,
               credit_score, employment_years, default_flag
        FROM dataset
        WHERE split = 'train'
    """);

    // Extract features and labels
    double[][] X = trainData.select("age", "income", "debt_ratio",
                                     "num_accounts", "credit_score",
                                     "employment_years").toArray();
    int[] y = trainData.column("default_flag").intStream().toArray();

    // Train a classifier
    var model = RandomForest.fit(smile.data.formula.Formula.lhs("default_flag"), trainData);
    System.out.println("Model trained on " + X.length + " samples with " + X[0].length + " features");

    // Score test data via SQL query
    DataFrame testData = db.query("""
        SELECT age, income, debt_ratio, num_accounts,
               credit_score, employment_years
        FROM dataset
        WHERE split = 'test'
    """);

    double[][] Xtest = testData.toArray();
    System.out.println("Test set: " + Xtest.length + " samples");
}

Handling NULL Values from SQL Results

import smile.data.SQL;
import smile.data.DataFrame;

try (var db = new SQL()) {
    db.csv("survey", "data/survey_results.csv");

    // Query may contain NULLs
    DataFrame raw = db.query("SELECT age, income, satisfaction_score FROM survey");

    // Option 1: Drop rows with nulls before conversion
    DataFrame clean = raw.dropna();
    double[][] X1 = clean.toArray();
    System.out.println("After dropna: " + X1.length + " rows");

    // Option 2: Fill nulls with a default value before conversion
    raw.fillna(0.0);
    double[][] X2 = raw.toArray();
    System.out.println("After fillna: " + X2.length + " rows");

    // Option 3: Convert directly (NULLs become Double.NaN)
    DataFrame raw2 = db.query("SELECT age, income, satisfaction_score FROM survey");
    double[][] X3 = raw2.toArray();
    // NaN values present -- handle during model training or preprocessing
}

Key Behaviors

  • Column selection: When column names are specified, only those columns are included in the output. When omitted, all columns are converted.
  • NULL mapping: SQL NULL values are encoded as Double.NaN in the output array/matrix.
  • Categorical expansion: With DUMMY or ONE_HOT encoding, the output matrix may have more columns than the input DataFrame due to binary expansion of categorical variables.
  • Column naming in DenseMatrix: Expanded categorical columns are named as "original_name_level_value" (e.g., "neighborhood_downtown").
  • Bias column naming: When bias=true, the first column is named "Intercept".
  • Matrix type: toMatrix() creates a Float64 DenseMatrix (DenseMatrix.zeros(Float64, nrow, ncol)).

Related

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment