Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Haifengl Smile DataFrame Numerical Conversion

From Leeroopedia


Overview

The DataFrame Numerical Conversion API provides methods on the DataFrame record for converting heterogeneous tabular data into homogeneous numerical arrays (double[][]) and matrices (DenseMatrix). These methods handle categorical encoding (level, dummy, one-hot), optional bias/intercept columns, missing value representation, and column selection -- producing the final numerical representation consumed by Smile's ML algorithms.

API Summary

Method Return Type Description
toArray(String... columns) double[][] Convert selected columns to a 2D array (level encoding, no bias)
toArray(boolean bias, CategoricalEncoder encoder, String... names) double[][] Convert with explicit bias and encoding options
toMatrix() DenseMatrix Convert all columns to a matrix (level encoding, no bias)
toMatrix(boolean bias, CategoricalEncoder encoder, String rowNames) DenseMatrix Convert with explicit bias, encoding, and row name column

Source Location

Property Value
File base/src/main/java/smile/data/DataFrame.java
Lines L742-925
Package smile.data
Repository github.com/haifengl/smile

Import

import smile.data.DataFrame;
import smile.data.CategoricalEncoder;
import smile.tensor.DenseMatrix;

Type: API Doc

Signature

public record DataFrame(StructType schema, List<ValueVector> columns, RowIndex index)
        implements Iterable<Row>, Serializable {

    // Convert to 2D double array
    public double[][] toArray(String... columns)
    public double[][] toArray(boolean bias,
        CategoricalEncoder encoder, String... names)

    // Convert to DenseMatrix
    public DenseMatrix toMatrix()
    public DenseMatrix toMatrix(boolean bias,
        CategoricalEncoder encoder, String rowNames)
}

CategoricalEncoder Enum

The CategoricalEncoder enum defines three encoding strategies for categorical (nominal/ordinal) variables:

public enum CategoricalEncoder {
    /** Level encoding: integer index of the category level. */
    LEVEL,

    /** Dummy encoding: k-1 binary columns (reference = first level). */
    DUMMY,

    /** One-hot encoding: k binary columns. */
    ONE_HOT
}
Encoder Columns per k-level Variable Suitable For
LEVEL 1 Tree-based models, ordinal data
DUMMY k1 Linear models with intercept (avoids dummy trap)
ONE_HOT k Neural networks, regularized models without intercept

Inputs and Outputs

Parameter Type Description Default
columns / names String... Column names to include (empty = all columns) All columns
bias boolean If true, prepend a column of 1.0 values false
encoder CategoricalEncoder Encoding strategy for categorical variables LEVEL
rowNames String Column to use as row names in the matrix (excluded from data) null
Returns double[][] or DenseMatrix The numerical representation --

Usage Examples

Example 1: Simple conversion to double array

import smile.io.Read;
import smile.data.DataFrame;

DataFrame iris = Read.csv("data/iris.csv",
    "delimiter=,,header=true");

// Select only numeric feature columns
DataFrame features = iris.select("sepal_length", "sepal_width",
    "petal_length", "petal_width");

// Convert all columns to a 2D array
double[][] X = features.toArray();
// X.length == 150 (rows), X[0].length == 4 (columns)

System.out.printf("Shape: %d x %d%n", X.length, X[0].length);

Example 2: Conversion with specific column selection

import smile.io.Read;
import smile.data.DataFrame;

DataFrame data = Read.csv("data/iris.csv",
    "delimiter=,,header=true");

// Convert only two columns -- no need to select first
double[][] X = data.toArray("sepal_length", "petal_length");
// X[0].length == 2

System.out.printf("Selected features: %d x %d%n",
    X.length, X[0].length);

Example 3: Dummy encoding with bias term for linear regression

import smile.io.Read;
import smile.data.DataFrame;
import smile.data.CategoricalEncoder;

DataFrame data = Read.csv("data/iris.csv",
    "delimiter=,,header=true");

// Factorize the species column first (string -> integer + NominalScale)
DataFrame encoded = data.factorize("species");

// Convert with intercept column and dummy encoding
// species has 3 levels -> 2 dummy columns
double[][] X = encoded.toArray(true, CategoricalEncoder.DUMMY,
    "sepal_length", "sepal_width", "petal_length",
    "petal_width", "species");

// X[0].length == 1 (bias) + 4 (numeric) + 2 (dummy) = 7
System.out.printf("With bias and dummy: %d x %d%n",
    X.length, X[0].length);
// First column is all 1.0 (intercept)
System.out.println("Bias column: " + X[0][0]); // 1.0

Example 4: One-hot encoding for neural networks

import smile.io.Read;
import smile.data.DataFrame;
import smile.data.CategoricalEncoder;

DataFrame data = Read.csv("data/iris.csv",
    "delimiter=,,header=true");

DataFrame encoded = data.factorize("species");

// One-hot encoding: species (3 levels) -> 3 binary columns
double[][] X = encoded.toArray(false, CategoricalEncoder.ONE_HOT,
    "sepal_length", "sepal_width", "petal_length",
    "petal_width", "species");

// X[0].length == 4 (numeric) + 3 (one-hot) = 7
System.out.printf("With one-hot: %d x %d%n",
    X.length, X[0].length);

Example 5: Conversion to DenseMatrix

import smile.io.Read;
import smile.data.DataFrame;
import smile.data.CategoricalEncoder;
import smile.tensor.DenseMatrix;

DataFrame data = Read.csv("data/iris.csv",
    "delimiter=,,header=true");

// Convert all numeric columns to a DenseMatrix
DataFrame numeric = data.select("sepal_length", "sepal_width",
    "petal_length", "petal_width");
DenseMatrix matrix = numeric.toMatrix();

System.out.printf("Matrix: %d x %d%n",
    matrix.nrow(), matrix.ncol());
// Matrix supports BLAS operations, named rows/columns, etc.

Example 6: DenseMatrix with row names and encoding

import smile.io.Read;
import smile.data.DataFrame;
import smile.data.CategoricalEncoder;
import smile.tensor.DenseMatrix;

DataFrame data = Read.csv("data/countries.csv",
    "delimiter=,,header=true");

DataFrame encoded = data.factorize("region");

// Use "country_name" column as row labels (excluded from data)
DenseMatrix matrix = encoded.toMatrix(true,
    CategoricalEncoder.DUMMY, "country_name");

// Row names are accessible via matrix
System.out.println("Column names: " +
    String.join(", ", matrix.colNames()));

Example 7: Full pipeline -- load, inspect, select, transform, convert

import smile.io.Read;
import smile.data.DataFrame;
import smile.data.CategoricalEncoder;
import smile.feature.transform.Standardizer;
import smile.data.transform.InvertibleColumnTransform;

// Stage 1: Load
DataFrame raw = Read.csv("data/iris.csv",
    "delimiter=,,header=true");

// Stage 2: Inspect
System.out.println("Schema: " + raw.schema());
System.out.println("Shape: " + raw.nrow() + " x " + raw.ncol());

// Stage 3: Select and factorize
DataFrame prepared = raw.factorize("species");

// Stage 4: Transform (standardize numeric features)
String[] featureCols = {"sepal_length", "sepal_width",
    "petal_length", "petal_width"};
InvertibleColumnTransform std = Standardizer.fit(prepared, featureCols);
DataFrame transformed = std.apply(prepared);

// Stage 5: Convert to numerical array
double[][] X = transformed.toArray(false,
    CategoricalEncoder.LEVEL, featureCols);
int[] y = transformed.column("species").intStream().toArray();

System.out.printf("X: %d x %d, y: %d%n",
    X.length, X[0].length, y.length);
// Ready for classification: RandomForest.fit(X, y, ...)

Implementation Details

toArray() Encoding Logic

The toArray() method iterates over the requested columns. For each column:

  1. If the column has a CategoricalMeasure and the encoder is not LEVEL:
    • DUMMY: Creates k1 columns. For each row, sets matrix[i][j + k - 1] = 1.0 where k is the factor index (skipping the reference level 0).
    • ONE_HOT: Creates k columns. For each row, sets matrix[i][j + k] = 1.0.
  2. Otherwise: Copies the double value directly with column.getDouble(i).

Missing values are naturally represented as Double.NaN since getDouble() returns NaN for null entries.

toMatrix() vs toArray()

toMatrix() produces a DenseMatrix (backed by Float64 scalar type via DenseMatrix.zeros(Float64, nrow, ncol)) with named columns and optional named rows. The encoding logic is identical to toArray(). The rowNames parameter specifies a column whose string values become row labels; that column is excluded from the numerical data.

Factor Mapping

Categorical encoding uses the CategoricalMeasure.factor(int value) method to map the stored integer value to its factor index. This accounts for potential gaps in the level encoding (e.g., if levels were merged or filtered).

Related Pages

Metadata

Property Value
Type API Doc
Language Java
Library Version 5.2.0
Last Updated 2026-02-08 22:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment