Implementation:Haifengl Smile DataFrame Numerical Conversion
Overview
The DataFrame Numerical Conversion API provides methods on the DataFrame record for converting heterogeneous tabular data into homogeneous numerical arrays (double[][]) and matrices (DenseMatrix). These methods handle categorical encoding (level, dummy, one-hot), optional bias/intercept columns, missing value representation, and column selection -- producing the final numerical representation consumed by Smile's ML algorithms.
API Summary
| Method | Return Type | Description |
|---|---|---|
toArray(String... columns) |
double[][] |
Convert selected columns to a 2D array (level encoding, no bias) |
toArray(boolean bias, CategoricalEncoder encoder, String... names) |
double[][] |
Convert with explicit bias and encoding options |
toMatrix() |
DenseMatrix |
Convert all columns to a matrix (level encoding, no bias) |
toMatrix(boolean bias, CategoricalEncoder encoder, String rowNames) |
DenseMatrix |
Convert with explicit bias, encoding, and row name column |
Source Location
| Property | Value |
|---|---|
| File | base/src/main/java/smile/data/DataFrame.java
|
| Lines | L742-925 |
| Package | smile.data
|
| Repository | github.com/haifengl/smile |
Import
import smile.data.DataFrame;
import smile.data.CategoricalEncoder;
import smile.tensor.DenseMatrix;
Type: API Doc
Signature
public record DataFrame(StructType schema, List<ValueVector> columns, RowIndex index)
implements Iterable<Row>, Serializable {
// Convert to 2D double array
public double[][] toArray(String... columns)
public double[][] toArray(boolean bias,
CategoricalEncoder encoder, String... names)
// Convert to DenseMatrix
public DenseMatrix toMatrix()
public DenseMatrix toMatrix(boolean bias,
CategoricalEncoder encoder, String rowNames)
}
CategoricalEncoder Enum
The CategoricalEncoder enum defines three encoding strategies for categorical (nominal/ordinal) variables:
public enum CategoricalEncoder {
/** Level encoding: integer index of the category level. */
LEVEL,
/** Dummy encoding: k-1 binary columns (reference = first level). */
DUMMY,
/** One-hot encoding: k binary columns. */
ONE_HOT
}
| Encoder | Columns per k-level Variable | Suitable For |
|---|---|---|
LEVEL |
1 | Tree-based models, ordinal data |
DUMMY |
Linear models with intercept (avoids dummy trap) | |
ONE_HOT |
Neural networks, regularized models without intercept |
Inputs and Outputs
| Parameter | Type | Description | Default |
|---|---|---|---|
columns / names |
String... |
Column names to include (empty = all columns) | All columns |
bias |
boolean |
If true, prepend a column of 1.0 values | false
|
encoder |
CategoricalEncoder |
Encoding strategy for categorical variables | LEVEL
|
rowNames |
String |
Column to use as row names in the matrix (excluded from data) | null
|
| Returns | double[][] or DenseMatrix |
The numerical representation | -- |
Usage Examples
Example 1: Simple conversion to double array
import smile.io.Read;
import smile.data.DataFrame;
DataFrame iris = Read.csv("data/iris.csv",
"delimiter=,,header=true");
// Select only numeric feature columns
DataFrame features = iris.select("sepal_length", "sepal_width",
"petal_length", "petal_width");
// Convert all columns to a 2D array
double[][] X = features.toArray();
// X.length == 150 (rows), X[0].length == 4 (columns)
System.out.printf("Shape: %d x %d%n", X.length, X[0].length);
Example 2: Conversion with specific column selection
import smile.io.Read;
import smile.data.DataFrame;
DataFrame data = Read.csv("data/iris.csv",
"delimiter=,,header=true");
// Convert only two columns -- no need to select first
double[][] X = data.toArray("sepal_length", "petal_length");
// X[0].length == 2
System.out.printf("Selected features: %d x %d%n",
X.length, X[0].length);
Example 3: Dummy encoding with bias term for linear regression
import smile.io.Read;
import smile.data.DataFrame;
import smile.data.CategoricalEncoder;
DataFrame data = Read.csv("data/iris.csv",
"delimiter=,,header=true");
// Factorize the species column first (string -> integer + NominalScale)
DataFrame encoded = data.factorize("species");
// Convert with intercept column and dummy encoding
// species has 3 levels -> 2 dummy columns
double[][] X = encoded.toArray(true, CategoricalEncoder.DUMMY,
"sepal_length", "sepal_width", "petal_length",
"petal_width", "species");
// X[0].length == 1 (bias) + 4 (numeric) + 2 (dummy) = 7
System.out.printf("With bias and dummy: %d x %d%n",
X.length, X[0].length);
// First column is all 1.0 (intercept)
System.out.println("Bias column: " + X[0][0]); // 1.0
Example 4: One-hot encoding for neural networks
import smile.io.Read;
import smile.data.DataFrame;
import smile.data.CategoricalEncoder;
DataFrame data = Read.csv("data/iris.csv",
"delimiter=,,header=true");
DataFrame encoded = data.factorize("species");
// One-hot encoding: species (3 levels) -> 3 binary columns
double[][] X = encoded.toArray(false, CategoricalEncoder.ONE_HOT,
"sepal_length", "sepal_width", "petal_length",
"petal_width", "species");
// X[0].length == 4 (numeric) + 3 (one-hot) = 7
System.out.printf("With one-hot: %d x %d%n",
X.length, X[0].length);
Example 5: Conversion to DenseMatrix
import smile.io.Read;
import smile.data.DataFrame;
import smile.data.CategoricalEncoder;
import smile.tensor.DenseMatrix;
DataFrame data = Read.csv("data/iris.csv",
"delimiter=,,header=true");
// Convert all numeric columns to a DenseMatrix
DataFrame numeric = data.select("sepal_length", "sepal_width",
"petal_length", "petal_width");
DenseMatrix matrix = numeric.toMatrix();
System.out.printf("Matrix: %d x %d%n",
matrix.nrow(), matrix.ncol());
// Matrix supports BLAS operations, named rows/columns, etc.
Example 6: DenseMatrix with row names and encoding
import smile.io.Read;
import smile.data.DataFrame;
import smile.data.CategoricalEncoder;
import smile.tensor.DenseMatrix;
DataFrame data = Read.csv("data/countries.csv",
"delimiter=,,header=true");
DataFrame encoded = data.factorize("region");
// Use "country_name" column as row labels (excluded from data)
DenseMatrix matrix = encoded.toMatrix(true,
CategoricalEncoder.DUMMY, "country_name");
// Row names are accessible via matrix
System.out.println("Column names: " +
String.join(", ", matrix.colNames()));
Example 7: Full pipeline -- load, inspect, select, transform, convert
import smile.io.Read;
import smile.data.DataFrame;
import smile.data.CategoricalEncoder;
import smile.feature.transform.Standardizer;
import smile.data.transform.InvertibleColumnTransform;
// Stage 1: Load
DataFrame raw = Read.csv("data/iris.csv",
"delimiter=,,header=true");
// Stage 2: Inspect
System.out.println("Schema: " + raw.schema());
System.out.println("Shape: " + raw.nrow() + " x " + raw.ncol());
// Stage 3: Select and factorize
DataFrame prepared = raw.factorize("species");
// Stage 4: Transform (standardize numeric features)
String[] featureCols = {"sepal_length", "sepal_width",
"petal_length", "petal_width"};
InvertibleColumnTransform std = Standardizer.fit(prepared, featureCols);
DataFrame transformed = std.apply(prepared);
// Stage 5: Convert to numerical array
double[][] X = transformed.toArray(false,
CategoricalEncoder.LEVEL, featureCols);
int[] y = transformed.column("species").intStream().toArray();
System.out.printf("X: %d x %d, y: %d%n",
X.length, X[0].length, y.length);
// Ready for classification: RandomForest.fit(X, y, ...)
Implementation Details
toArray() Encoding Logic
The toArray() method iterates over the requested columns. For each column:
- If the column has a
CategoricalMeasureand the encoder is notLEVEL:- DUMMY: Creates columns. For each row, sets
matrix[i][j + k - 1] = 1.0where is the factor index (skipping the reference level 0). - ONE_HOT: Creates columns. For each row, sets
matrix[i][j + k] = 1.0.
- DUMMY: Creates columns. For each row, sets
- Otherwise: Copies the double value directly with
column.getDouble(i).
Missing values are naturally represented as Double.NaN since getDouble() returns NaN for null entries.
toMatrix() vs toArray()
toMatrix() produces a DenseMatrix (backed by Float64 scalar type via DenseMatrix.zeros(Float64, nrow, ncol)) with named columns and optional named rows. The encoding logic is identical to toArray(). The rowNames parameter specifies a column whose string values become row labels; that column is excluded from the numerical data.
Factor Mapping
Categorical encoding uses the CategoricalMeasure.factor(int value) method to map the stored integer value to its factor index. This accounts for potential gaps in the level encoding (e.g., if levels were merged or filtered).
Related Pages
Metadata
| Property | Value |
|---|---|
| Type | API Doc |
| Language | Java |
| Library Version | 5.2.0 |
| Last Updated | 2026-02-08 22:00 GMT |