Principle:Scikit learn Scikit learn Model Training

Field	Value
sources	Bishop, C.M. (2006). Pattern Recognition and Machine Learning, Springer; scikit-learn documentation: https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
domains	Machine_Learning, Optimization, Statistics
last_updated	2026-02-08 15:00 GMT

Overview

An optimization process that adjusts model parameters to minimize a loss function on training data.

Description

Model training (also called fitting) is the core computational step in supervised learning. Given a training set of labeled examples ${(𝐱_{i}, y_{i})}_{i = 1}^{n}$ , the training process searches for parameter values that minimize a chosen loss function over the training data, subject to optional regularization constraints.

In scikit-learn, training is triggered by calling the fit(X, y) method on an instantiated estimator. This method:

Validates and preprocesses the input data (type checking, dtype conversion, sparse format handling).
Executes the optimization algorithm specified by the estimator's hyperparameters.
Stores the learned parameters as instance attributes with trailing underscores (e.g., coef_, intercept_).
Returns self, enabling method chaining.

The specific optimization strategy depends on the estimator and its configuration. Common approaches include:

Gradient-based optimization -- Iterative methods such as L-BFGS, Newton-CG, and stochastic average gradient (SAG/SAGA) that use gradient information to find the loss minimum.
Coordinate descent -- Used by the liblinear solver for L1-regularized problems.
Closed-form solutions -- Some estimators (e.g., ordinary least squares) compute parameters directly via matrix algebra.

Regularization is a technique applied during training to prevent overfitting by penalizing large parameter values. Common regularization forms include L2 (ridge), L1 (lasso), and Elastic-Net (a combination of L1 and L2).

Usage

Use model training when:

Fitting a model to labeled data -- The standard supervised learning workflow requires calling fit(X_train, y_train) on the training subset.
Retraining after hyperparameter changes -- After modifying hyperparameters via set_params, the model must be re-fitted.
Warm starting -- Some estimators support warm_start=True, allowing training to resume from previously learned parameters rather than starting from scratch.

Theoretical Basis

Maximum Likelihood Estimation

For classification with logistic regression, training corresponds to maximum likelihood estimation (MLE). The model assumes that the probability of class $k$ given features $𝐱$ follows the softmax (multinomial) or sigmoid (binary) function of a linear combination of features.

In the binary case, the model estimates:

$P (y = 1 | 𝐱) = σ (𝐰^{T} 𝐱 + b) = \frac{1}{1 + e^{- (𝐰^{T} 𝐱 + b)}}$

where $σ$ is the logistic sigmoid function, $𝐰$ is the weight vector, and $b$ is the bias (intercept).

Loss Minimization

MLE is equivalent to minimizing the negative log-likelihood, which for logistic regression yields the logistic loss (also called cross-entropy loss or log loss):

$ℒ (𝐰, b) = - \frac{1}{n} \sum_{i = 1}^{n} [y_{i} \log ({\hat{p}}_{i}) + (1 - y_{i}) \log (1 - {\hat{p}}_{i})]$

where ${\hat{p}}_{i} = σ (𝐰^{T} 𝐱_{i} + b)$ is the predicted probability for sample $i$ .

Regularized Objective

With regularization, the objective becomes:

$\min_{𝐰, b} \frac{1}{2 C} [(1 - α) ‖ 𝐰 ‖_{2}^{2} + α ‖ 𝐰 ‖_{1}] + ℒ (𝐰, b)$

where $C$ is the inverse regularization strength and $α$ is the L1 ratio (Elastic-Net mixing parameter). Setting $α = 0$ yields pure L2 regularization; $α = 1$ yields pure L1 regularization.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment