Principle:Haifengl Smile Numerical Conversion

Overview

Numerical Conversion is the principle of converting a heterogeneous tabular DataFrame into homogeneous numerical arrays (double[][]) or matrices (DenseMatrix) that can be consumed directly by machine learning algorithms. This is the final stage of the data loading pipeline, bridging the gap between the data management world (typed columns, mixed types, categorical variables) and the mathematical computation world (real-valued vectors and matrices).

The conversion must handle:

Numeric columns -- Directly copied to the output array.
Categorical columns -- Encoded as numerical values using one of several encoding strategies (level encoding, dummy encoding, or one-hot encoding).
Missing values -- Represented as Double.NaN in the output.
Bias/intercept term -- Optionally prepended as a column of ones for linear models.

Theoretical Basis

The Numerical Imperative

Machine learning algorithms operate on elements of $ℝ^{p}$ -- real-valued vectors in $p$ -dimensional space. A dataset of $n$ observations with $p$ features is represented as a design matrix:

$X \in ℝ^{n \times p}$

where $X_{i j}$ is the value of feature $j$ for observation $i$ . This matrix is the universal input format for:

Linear models: $\hat{y} = X β$
Distance-based methods: $d (x_{i}, x_{j}) = ‖ x_{i} - x_{j} ‖$
Gradient descent: $\nabla_{β} L = X^{T} (X β - y)$
Matrix factorizations: SVD, PCA, NMF

Categorical Encoding

Categorical variables with $k$ levels ${c_{1}, c_{2}, \dots, c_{k}}$ cannot be directly used as real numbers. Three encoding strategies are supported:

Level Encoding

The simplest approach assigns the integer index directly:

$encode (c_{j}) = j, j \in {0, 1, \dots, k - 1}$

This produces a single column but implies an ordinal relationship that may not exist (e.g., "red" < "green" < "blue" is meaningless).

Dummy Encoding (Reference Coding)

Creates $k - 1$ binary indicator columns, with the first level as the reference:

$encode (c_{j}) = (d_{1}, d_{2}, \dots, d_{k - 1}) where d_{i} = {\begin{cases} 1 & if j = i + 1 \\ 0 & otherwise \end{cases}$

The reference level $c_{1}$ is encoded as all zeros. This avoids the dummy variable trap (multicollinearity with an intercept term) because the encoding has rank $k - 1$ .

One-Hot Encoding

Creates $k$ binary indicator columns:

$encode (c_{j}) = (h_{1}, h_{2}, \dots, h_{k}) where h_{i} = {\begin{cases} 1 & if j = i \\ 0 & otherwise \end{cases}$

This is full-rank with respect to the categorical variable but creates multicollinearity with an intercept term.

Bias Term

For linear models $y = β_{0} + β_{1} x_{1} + \dots + β_{p} x_{p}$ , the intercept $β_{0}$ is handled by prepending a column of ones:

$\tilde{X} = [\begin{matrix} 1 & x_{11} & \dots & x_{1 p} \\ 1 & x_{21} & \dots & x_{2 p} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ 1 & x_{n 1} & \dots & x_{n p} \end{matrix}]$

so that $\hat{y} = \tilde{X} \tilde{β}$ where $\tilde{β} = (β_{0}, β_{1}, \dots, β_{p})^{T}$ .

In Smile, setting bias=true in toArray() or toMatrix() adds this column.

Output Dimensionality

The total number of output columns depends on the encoding:

Encoding	Columns per Numeric Feature	Columns per $k$ -level Categorical	Bias
Level	1	1	+1 if enabled
Dummy	1	$k - 1$	+1 if enabled
One-Hot	1	$k$	+1 if enabled

Total output columns: $p^{'} = b + \sum_{j = 1}^{p} c_{j}$ where $b \in {0, 1}$ is the bias and $c_{j}$ is the column count for feature $j$ .

Design Considerations

Array vs Matrix

Smile provides two output formats:

double[][] -- A Java 2D array. Efficient for row-oriented access. Used by most classification and regression algorithms.
DenseMatrix -- A Smile tensor/matrix object. Supports named rows and columns, BLAS/LAPACK operations, and is optimized for linear algebra. Used by matrix decomposition algorithms (SVD, PCA).

Column Selection

Both toArray() and toMatrix() accept optional column name parameters. If omitted, all columns are converted. This allows selective conversion of a subset without prior select().

Relationship to the Data Loading Pipeline

Numerical Conversion is the fifth and final stage of the Smile Data Loading Pipeline:

File Data Loading -- Read data from files.
DataFrame Inspection -- Examine structure and metadata.
Column Selection and Filtering -- Select relevant columns.
Data Transformation -- Normalize and scale features.
Numerical Conversion -- Convert to arrays/matrices for ML algorithms. (current)

The output of this stage is the direct input to Smile's classification, regression, clustering, and dimensionality reduction algorithms.

Related Pages

Implementation:Haifengl_Smile_DataFrame_Numerical_Conversion

Knowledge Sources

Smile

Metadata

Property	Value
Domains	Data_Engineering, ETL
Workflow	Data_Loading_Pipeline
Stage	5 of 5
Last Updated	2026-02-08 22:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment