Implementation:Haifengl Smile DataFrame Numerical Conversion

Overview

The DataFrame Numerical Conversion API provides methods on the DataFrame record for converting heterogeneous tabular data into homogeneous numerical arrays (double[][]) and matrices (DenseMatrix). These methods handle categorical encoding (level, dummy, one-hot), optional bias/intercept columns, missing value representation, and column selection -- producing the final numerical representation consumed by Smile's ML algorithms.

API Summary

Method	Return Type	Description
`toArray(String... columns)`	`double[][]`	Convert selected columns to a 2D array (level encoding, no bias)
`toArray(boolean bias, CategoricalEncoder encoder, String... names)`	`double[][]`	Convert with explicit bias and encoding options
`toMatrix()`	`DenseMatrix`	Convert all columns to a matrix (level encoding, no bias)
`toMatrix(boolean bias, CategoricalEncoder encoder, String rowNames)`	`DenseMatrix`	Convert with explicit bias, encoding, and row name column

Source Location

Property	Value
File	`base/src/main/java/smile/data/DataFrame.java`
Lines	L742-925
Package	`smile.data`
Repository	github.com/haifengl/smile

Import

import smile.data.DataFrame;
import smile.data.CategoricalEncoder;
import smile.tensor.DenseMatrix;

Type: API Doc

Signature

public record DataFrame(StructType schema, List<ValueVector> columns, RowIndex index)
        implements Iterable<Row>, Serializable {

    // Convert to 2D double array
    public double[][] toArray(String... columns)
    public double[][] toArray(boolean bias,
        CategoricalEncoder encoder, String... names)

    // Convert to DenseMatrix
    public DenseMatrix toMatrix()
    public DenseMatrix toMatrix(boolean bias,
        CategoricalEncoder encoder, String rowNames)
}

CategoricalEncoder Enum

The CategoricalEncoder enum defines three encoding strategies for categorical (nominal/ordinal) variables:

public enum CategoricalEncoder {
    /** Level encoding: integer index of the category level. */
    LEVEL,

    /** Dummy encoding: k-1 binary columns (reference = first level). */
    DUMMY,

    /** One-hot encoding: k binary columns. */
    ONE_HOT
}

Encoder	Columns per k-level Variable	Suitable For
`LEVEL`	1	Tree-based models, ordinal data
`DUMMY`	$k - 1$	Linear models with intercept (avoids dummy trap)
`ONE_HOT`	$k$	Neural networks, regularized models without intercept

Inputs and Outputs

Parameter	Type	Description	Default
`columns` / `names`	`String...`	Column names to include (empty = all columns)	All columns
`bias`	`boolean`	If true, prepend a column of 1.0 values	`false`
`encoder`	`CategoricalEncoder`	Encoding strategy for categorical variables	`LEVEL`
`rowNames`	`String`	Column to use as row names in the matrix (excluded from data)	`null`
Returns	`double[][]` or `DenseMatrix`	The numerical representation	--

Usage Examples

Example 1: Simple conversion to double array

import smile.io.Read;
import smile.data.DataFrame;

DataFrame iris = Read.csv("data/iris.csv",
    "delimiter=,,header=true");

// Select only numeric feature columns
DataFrame features = iris.select("sepal_length", "sepal_width",
    "petal_length", "petal_width");

// Convert all columns to a 2D array
double[][] X = features.toArray();
// X.length == 150 (rows), X[0].length == 4 (columns)

System.out.printf("Shape: %d x %d%n", X.length, X[0].length);

Example 2: Conversion with specific column selection

import smile.io.Read;
import smile.data.DataFrame;

DataFrame data = Read.csv("data/iris.csv",
    "delimiter=,,header=true");

// Convert only two columns -- no need to select first
double[][] X = data.toArray("sepal_length", "petal_length");
// X[0].length == 2

System.out.printf("Selected features: %d x %d%n",
    X.length, X[0].length);

Example 3: Dummy encoding with bias term for linear regression

import smile.io.Read;
import smile.data.DataFrame;
import smile.data.CategoricalEncoder;

DataFrame data = Read.csv("data/iris.csv",
    "delimiter=,,header=true");

// Factorize the species column first (string -> integer + NominalScale)
DataFrame encoded = data.factorize("species");

// Convert with intercept column and dummy encoding
// species has 3 levels -> 2 dummy columns
double[][] X = encoded.toArray(true, CategoricalEncoder.DUMMY,
    "sepal_length", "sepal_width", "petal_length",
    "petal_width", "species");

// X[0].length == 1 (bias) + 4 (numeric) + 2 (dummy) = 7
System.out.printf("With bias and dummy: %d x %d%n",
    X.length, X[0].length);
// First column is all 1.0 (intercept)
System.out.println("Bias column: " + X[0][0]); // 1.0

Example 4: One-hot encoding for neural networks

import smile.io.Read;
import smile.data.DataFrame;
import smile.data.CategoricalEncoder;

DataFrame data = Read.csv("data/iris.csv",
    "delimiter=,,header=true");

DataFrame encoded = data.factorize("species");

// One-hot encoding: species (3 levels) -> 3 binary columns
double[][] X = encoded.toArray(false, CategoricalEncoder.ONE_HOT,
    "sepal_length", "sepal_width", "petal_length",
    "petal_width", "species");

// X[0].length == 4 (numeric) + 3 (one-hot) = 7
System.out.printf("With one-hot: %d x %d%n",
    X.length, X[0].length);

Example 5: Conversion to DenseMatrix

import smile.io.Read;
import smile.data.DataFrame;
import smile.data.CategoricalEncoder;
import smile.tensor.DenseMatrix;

DataFrame data = Read.csv("data/iris.csv",
    "delimiter=,,header=true");

// Convert all numeric columns to a DenseMatrix
DataFrame numeric = data.select("sepal_length", "sepal_width",
    "petal_length", "petal_width");
DenseMatrix matrix = numeric.toMatrix();

System.out.printf("Matrix: %d x %d%n",
    matrix.nrow(), matrix.ncol());
// Matrix supports BLAS operations, named rows/columns, etc.

Example 6: DenseMatrix with row names and encoding

import smile.io.Read;
import smile.data.DataFrame;
import smile.data.CategoricalEncoder;
import smile.tensor.DenseMatrix;

DataFrame data = Read.csv("data/countries.csv",
    "delimiter=,,header=true");

DataFrame encoded = data.factorize("region");

// Use "country_name" column as row labels (excluded from data)
DenseMatrix matrix = encoded.toMatrix(true,
    CategoricalEncoder.DUMMY, "country_name");

// Row names are accessible via matrix
System.out.println("Column names: " +
    String.join(", ", matrix.colNames()));

Example 7: Full pipeline -- load, inspect, select, transform, convert

import smile.io.Read;
import smile.data.DataFrame;
import smile.data.CategoricalEncoder;
import smile.feature.transform.Standardizer;
import smile.data.transform.InvertibleColumnTransform;

// Stage 1: Load
DataFrame raw = Read.csv("data/iris.csv",
    "delimiter=,,header=true");

// Stage 2: Inspect
System.out.println("Schema: " + raw.schema());
System.out.println("Shape: " + raw.nrow() + " x " + raw.ncol());

// Stage 3: Select and factorize
DataFrame prepared = raw.factorize("species");

// Stage 4: Transform (standardize numeric features)
String[] featureCols = {"sepal_length", "sepal_width",
    "petal_length", "petal_width"};
InvertibleColumnTransform std = Standardizer.fit(prepared, featureCols);
DataFrame transformed = std.apply(prepared);

// Stage 5: Convert to numerical array
double[][] X = transformed.toArray(false,
    CategoricalEncoder.LEVEL, featureCols);
int[] y = transformed.column("species").intStream().toArray();

System.out.printf("X: %d x %d, y: %d%n",
    X.length, X[0].length, y.length);
// Ready for classification: RandomForest.fit(X, y, ...)

Implementation Details

toArray() Encoding Logic

The toArray() method iterates over the requested columns. For each column:

If the column has a CategoricalMeasure and the encoder is not LEVEL:
- DUMMY: Creates $k - 1$ columns. For each row, sets matrix[i][j + k - 1] = 1.0 where $k$ is the factor index (skipping the reference level 0).
- ONE_HOT: Creates $k$ columns. For each row, sets matrix[i][j + k] = 1.0.
Otherwise: Copies the double value directly with column.getDouble(i).

Missing values are naturally represented as Double.NaN since getDouble() returns NaN for null entries.

toMatrix() vs toArray()

toMatrix() produces a DenseMatrix (backed by Float64 scalar type via DenseMatrix.zeros(Float64, nrow, ncol)) with named columns and optional named rows. The encoding logic is identical to toArray(). The rowNames parameter specifies a column whose string values become row labels; that column is excluded from the numerical data.

Factor Mapping

Categorical encoding uses the CategoricalMeasure.factor(int value) method to map the stored integer value to its factor index. This accounts for potential gaps in the level encoding (e.g., if levels were merged or filtered).

Related Pages

Principle:Haifengl_Smile_Numerical_Conversion

Metadata

Property	Value
Type	API Doc
Language	Java
Library Version	5.2.0
Last Updated	2026-02-08 22:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment