Principle:Haifengl Smile Numerical Conversion
Overview
Numerical Conversion is the principle of converting a heterogeneous tabular DataFrame into homogeneous numerical arrays (double[][]) or matrices (DenseMatrix) that can be consumed directly by machine learning algorithms. This is the final stage of the data loading pipeline, bridging the gap between the data management world (typed columns, mixed types, categorical variables) and the mathematical computation world (real-valued vectors and matrices).
The conversion must handle:
- Numeric columns -- Directly copied to the output array.
- Categorical columns -- Encoded as numerical values using one of several encoding strategies (level encoding, dummy encoding, or one-hot encoding).
- Missing values -- Represented as
Double.NaNin the output. - Bias/intercept term -- Optionally prepended as a column of ones for linear models.
Theoretical Basis
The Numerical Imperative
Machine learning algorithms operate on elements of -- real-valued vectors in -dimensional space. A dataset of observations with features is represented as a design matrix:
where is the value of feature for observation . This matrix is the universal input format for:
- Linear models:
- Distance-based methods:
- Gradient descent:
- Matrix factorizations: SVD, PCA, NMF
Categorical Encoding
Categorical variables with levels cannot be directly used as real numbers. Three encoding strategies are supported:
Level Encoding
The simplest approach assigns the integer index directly:
This produces a single column but implies an ordinal relationship that may not exist (e.g., "red" < "green" < "blue" is meaningless).
Dummy Encoding (Reference Coding)
Creates binary indicator columns, with the first level as the reference:
The reference level is encoded as all zeros. This avoids the dummy variable trap (multicollinearity with an intercept term) because the encoding has rank .
One-Hot Encoding
Creates binary indicator columns:
This is full-rank with respect to the categorical variable but creates multicollinearity with an intercept term.
Bias Term
For linear models , the intercept is handled by prepending a column of ones:
so that where .
In Smile, setting bias=true in toArray() or toMatrix() adds this column.
Output Dimensionality
The total number of output columns depends on the encoding:
| Encoding | Columns per Numeric Feature | Columns per -level Categorical | Bias |
|---|---|---|---|
| Level | 1 | 1 | +1 if enabled |
| Dummy | 1 | +1 if enabled | |
| One-Hot | 1 | +1 if enabled |
Total output columns: where is the bias and is the column count for feature .
Design Considerations
Array vs Matrix
Smile provides two output formats:
double[][]-- A Java 2D array. Efficient for row-oriented access. Used by most classification and regression algorithms.DenseMatrix-- A Smile tensor/matrix object. Supports named rows and columns, BLAS/LAPACK operations, and is optimized for linear algebra. Used by matrix decomposition algorithms (SVD, PCA).
Column Selection
Both toArray() and toMatrix() accept optional column name parameters. If omitted, all columns are converted. This allows selective conversion of a subset without prior select().
Relationship to the Data Loading Pipeline
Numerical Conversion is the fifth and final stage of the Smile Data Loading Pipeline:
- File Data Loading -- Read data from files.
- DataFrame Inspection -- Examine structure and metadata.
- Column Selection and Filtering -- Select relevant columns.
- Data Transformation -- Normalize and scale features.
- Numerical Conversion -- Convert to arrays/matrices for ML algorithms. (current)
The output of this stage is the direct input to Smile's classification, regression, clustering, and dimensionality reduction algorithms.
Related Pages
Knowledge Sources
Metadata
| Property | Value |
|---|---|
| Domains | Data_Engineering, ETL |
| Workflow | Data_Loading_Pipeline |
| Stage | 5 of 5 |
| Last Updated | 2026-02-08 22:00 GMT |