Principle:Dotnet Machinelearning Feature Engineering
| Knowledge Sources | |
|---|---|
| Domains | Machine Learning, Feature Engineering, Data Preprocessing |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Feature engineering transforms raw data columns of heterogeneous types into unified numeric feature vectors that machine learning algorithms can consume.
Description
Raw datasets contain columns of different types: categorical strings, free-form text, integers, floats, dates, and more. Most ML algorithms operate exclusively on dense or sparse numeric vectors. Feature engineering bridges this gap through a set of transforms that convert each column type into a numeric representation and then concatenate the results into a single feature vector.
The key transforms include:
- One-hot encoding for categorical data: maps each distinct category value to a binary indicator vector. A column with k unique values becomes a vector of length k with exactly one element set to 1.
- Text featurization for string data: applies tokenization, n-gram extraction, stop-word removal, and TF-IDF weighting to produce a numeric vector capturing the statistical properties of the text.
- Min-max normalization for numeric data: rescales numeric features to a [0, 1] range (or similar) so that features with large magnitudes do not dominate distance-based or gradient-based algorithms.
- Concatenation combines multiple transformed columns into a single vector column (typically named "Features") that trainers consume.
Feature engineering is implemented as a pipeline of estimators. Each estimator declares its input and output columns, and the pipeline chains them together. When Fit is called, the pipeline learns any data-dependent parameters (e.g., the vocabulary for one-hot encoding, the min/max values for normalization). The resulting transformer can then be applied to new data.
Usage
Apply feature engineering after data loading and splitting but before training. Choose transforms based on column types: one-hot encoding for low-cardinality categoricals, text featurization for natural language, normalization for numerics with varied scales. Always concatenate into a single "Features" column as the final step.
Theoretical Basis
One-hot encoding represents a categorical variable C with k distinct values as a vector in R^k:
OneHot(c_i) = e_i where e_i is the i-th standard basis vector in R^k
Example: colors = {red, green, blue}
OneHot(red) = [1, 0, 0]
OneHot(green) = [0, 1, 0]
OneHot(blue) = [0, 0, 1]
TF-IDF text featurization combines term frequency and inverse document frequency:
TF(t, d) = count(t in d) / |d|
IDF(t, D) = log(|D| / (1 + count(d in D : t in d)))
TF-IDF(t, d, D) = TF(t, d) * IDF(t, D)
This assigns higher weights to terms that are frequent within a document but rare across the corpus.
Min-max normalization rescales a feature x to the [0, 1] interval:
x_normalized = (x - x_min) / (x_max - x_min)
When fixZero is enabled, the formula is adjusted to ensure that zero maps to zero, preserving sparsity:
x_normalized = x / max(|x_min|, |x_max|) (if fixZero=true)
Concatenation produces the final feature vector by appending all transformed columns:
Features = [OneHot(C1) ; Normalize(N1) ; TF-IDF(T1)]
where ; denotes vector concatenation.