Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Haifengl Smile Numerical Conversion

From Leeroopedia


Overview

Numerical Conversion is the principle of converting a heterogeneous tabular DataFrame into homogeneous numerical arrays (double[][]) or matrices (DenseMatrix) that can be consumed directly by machine learning algorithms. This is the final stage of the data loading pipeline, bridging the gap between the data management world (typed columns, mixed types, categorical variables) and the mathematical computation world (real-valued vectors and matrices).

The conversion must handle:

  • Numeric columns -- Directly copied to the output array.
  • Categorical columns -- Encoded as numerical values using one of several encoding strategies (level encoding, dummy encoding, or one-hot encoding).
  • Missing values -- Represented as Double.NaN in the output.
  • Bias/intercept term -- Optionally prepended as a column of ones for linear models.

Theoretical Basis

The Numerical Imperative

Machine learning algorithms operate on elements of p -- real-valued vectors in p-dimensional space. A dataset of n observations with p features is represented as a design matrix:

Xn×p

where Xij is the value of feature j for observation i. This matrix is the universal input format for:

  • Linear models: y^=Xβ
  • Distance-based methods: d(xi,xj)=xixj
  • Gradient descent: βL=XT(Xβy)
  • Matrix factorizations: SVD, PCA, NMF

Categorical Encoding

Categorical variables with k levels {c1,c2,,ck} cannot be directly used as real numbers. Three encoding strategies are supported:

Level Encoding

The simplest approach assigns the integer index directly:

encode(cj)=j,j{0,1,,k1}

This produces a single column but implies an ordinal relationship that may not exist (e.g., "red" < "green" < "blue" is meaningless).

Dummy Encoding (Reference Coding)

Creates k1 binary indicator columns, with the first level as the reference:

encode(cj)=(d1,d2,,dk1) where di={1if j=i+10otherwise

The reference level c1 is encoded as all zeros. This avoids the dummy variable trap (multicollinearity with an intercept term) because the encoding has rank k1.

One-Hot Encoding

Creates k binary indicator columns:

encode(cj)=(h1,h2,,hk) where hi={1if j=i0otherwise

This is full-rank with respect to the categorical variable but creates multicollinearity with an intercept term.

Bias Term

For linear models y=β0+β1x1++βpxp, the intercept β0 is handled by prepending a column of ones:

X~=[1x11x1p1x21x2p1xn1xnp]

so that y^=X~β~ where β~=(β0,β1,,βp)T.

In Smile, setting bias=true in toArray() or toMatrix() adds this column.

Output Dimensionality

The total number of output columns depends on the encoding:

Encoding Columns per Numeric Feature Columns per k-level Categorical Bias
Level 1 1 +1 if enabled
Dummy 1 k1 +1 if enabled
One-Hot 1 k +1 if enabled

Total output columns: p=b+j=1pcj where b{0,1} is the bias and cj is the column count for feature j.

Design Considerations

Array vs Matrix

Smile provides two output formats:

  • double[][] -- A Java 2D array. Efficient for row-oriented access. Used by most classification and regression algorithms.
  • DenseMatrix -- A Smile tensor/matrix object. Supports named rows and columns, BLAS/LAPACK operations, and is optimized for linear algebra. Used by matrix decomposition algorithms (SVD, PCA).

Column Selection

Both toArray() and toMatrix() accept optional column name parameters. If omitted, all columns are converted. This allows selective conversion of a subset without prior select().

Relationship to the Data Loading Pipeline

Numerical Conversion is the fifth and final stage of the Smile Data Loading Pipeline:

  1. File Data Loading -- Read data from files.
  2. DataFrame Inspection -- Examine structure and metadata.
  3. Column Selection and Filtering -- Select relevant columns.
  4. Data Transformation -- Normalize and scale features.
  5. Numerical Conversion -- Convert to arrays/matrices for ML algorithms. (current)

The output of this stage is the direct input to Smile's classification, regression, clustering, and dimensionality reduction algorithms.

Related Pages

Knowledge Sources

Metadata

Property Value
Domains Data_Engineering, ETL
Workflow Data_Loading_Pipeline
Stage 5 of 5
Last Updated 2026-02-08 22:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment