Principle:Dotnet Machinelearning Binary Classification Training
| Knowledge Sources | |
|---|---|
| Domains | Machine Learning, Classification, Supervised Learning |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Binary classification is a supervised learning task that assigns each input instance to one of exactly two classes based on learned decision boundaries derived from labeled training data.
Description
Binary classification is one of the most common machine learning tasks. Given a set of labeled examples where each label is either positive or negative (1 or 0, true or false), the goal is to learn a function f(x) that maps input feature vectors to class predictions. The learned function generalizes to unseen examples by capturing patterns in the training data.
Three prominent algorithmic families for binary classification are:
- Gradient boosted decision trees (e.g., FastTree, LightGBM): Build an ensemble of shallow decision trees sequentially, where each tree corrects the errors of the previous ensemble. These methods are highly effective on tabular data with mixed feature types.
- Stochastic dual coordinate ascent (SDCA) for logistic regression: An optimization algorithm that solves the logistic regression objective by iterating over dual variables. SDCA is efficient for large, sparse datasets and converges to a linear decision boundary.
The estimator-transformer pattern cleanly separates the configuration of a training algorithm (hyperparameters, column names) from its execution (fitting on data). An estimator encapsulates the algorithm configuration and implements a Fit method that, given training data, produces a transformer. The transformer is the trained model that can score new data.
This separation enables:
- Pipeline composition: chain transforms and trainers into a single estimator pipeline.
- Reproducibility: the same estimator with the same data produces the same model.
- Serialization: transformers can be saved and loaded independently of their estimator.
Usage
Use binary classification when the target variable has exactly two classes. Choose gradient boosted trees (FastTree or LightGBM) as a strong default for tabular data with moderate to large feature counts. Choose SDCA logistic regression when you need a linear model for interpretability, or when dealing with very high-dimensional sparse features (e.g., bag-of-words text features).
Theoretical Basis
Logistic regression models the log-odds of the positive class as a linear function of features:
P(y=1|x) = sigma(w^T x + b) = 1 / (1 + exp(-(w^T x + b)))
Loss = -sum_i [ y_i * log(P_i) + (1 - y_i) * log(1 - P_i) ] (cross-entropy)
SDCA minimizes a regularized loss by iterating over training examples and updating dual variables:
minimize (1/n) * sum_i loss(w^T x_i, y_i) + (lambda/2) * ||w||^2
For each example i:
delta_alpha_i = argmax improvement in dual objective
alpha_i += delta_alpha_i
w += (delta_alpha_i / (lambda * n)) * x_i
Gradient boosted trees fit an additive model of trees:
F_0(x) = initial prediction (e.g., log-odds of positive class)
For m = 1 to M:
r_i = -dLoss/dF_{m-1}(x_i) // pseudo-residuals (negative gradient)
h_m = fit regression tree to {(x_i, r_i)}
F_m(x) = F_{m-1}(x) + eta * h_m(x) // eta = learning rate
Prediction: P(y=1|x) = sigma(F_M(x))
Key hyperparameters include numberOfLeaves (tree complexity), learningRate (step size), and numberOfTrees (ensemble size). Smaller learning rates with more trees typically yield better generalization.