Principle:Scikit learn Scikit learn Model Instantiation
| Field | Value |
|---|---|
| sources | Buitinck, L. et al. (2013). API design for machine learning software: experiences from the scikit-learn project. ECML PKDD Workshop: Languages for Data Mining and Machine Learning; scikit-learn documentation: https://scikit-learn.org/stable/developers/develop.html |
| domains | Machine_Learning, Software_Engineering |
| last_updated | 2026-02-08 15:00 GMT |
Overview
A design pattern that configures a machine learning estimator with hyperparameters before training.
Description
In scikit-learn, every machine learning algorithm is represented as a Python class that follows the estimator pattern. Model instantiation is the act of creating an instance of such a class, passing hyperparameters as constructor arguments. This step configures how the model will learn but does not yet perform any learning.
The estimator pattern is enforced through the BaseEstimator base class, which provides:
get_params(deep=True)-- Returns a dictionary of the estimator's hyperparameters and their current values. This enables introspection, serialization, and use within meta-estimators such asGridSearchCV.set_params(**params)-- Sets hyperparameters by name, allowing programmatic reconfiguration. This is used internally by hyperparameter search utilities.
The key design rule is that every constructor parameter must be stored as an instance attribute with the same name. For example, if the constructor accepts C=1.0, the instance must have self.C = 1.0. No validation or transformation of hyperparameters is performed in the constructor; validation is deferred to the fit method.
Usage
Use model instantiation when:
- Configuring a classifier or regressor -- Set hyperparameters such as regularization strength, solver algorithm, convergence tolerance, and maximum iterations.
- Building pipelines -- Instantiated estimators are composed into
Pipelineobjects, where each step is an estimator instance. - Hyperparameter search -- Tools like
GridSearchCVuseset_paramsto reconfigure estimators across different hyperparameter combinations. - Reproducibility -- Explicitly setting
random_stateduring instantiation ensures deterministic behavior across runs.
Theoretical Basis
Separation of Configuration from Execution
The estimator pattern embodies a strict separation between configuration (constructor) and execution (fit/predict/transform). This separation has several benefits:
- Declarative specification -- The constructor call serves as a complete, human-readable specification of the model's configuration. No hidden state is introduced.
- Cloning -- The
sklearn.base.clonefunction creates a new estimator with the same hyperparameters but without fitted state, which is essential for cross-validation where a fresh model must be trained on each fold. - Introspection -- The
get_params/set_paramsprotocol enables generic tools to manipulate any estimator without knowing its specific class.
Reproducibility
Many algorithms involve randomness (random initialization, stochastic optimization, data sampling). By accepting a random_state parameter at instantiation time, scikit-learn allows users to fix the random seed, ensuring that repeated runs produce identical results. This is critical for scientific reproducibility and debugging.
Hyperparameters vs. Learned Parameters
It is important to distinguish between:
- Hyperparameters -- Set by the user before training (e.g.,
C,max_iter,solver). These are arguments to the constructor. - Learned (fitted) parameters -- Estimated from data during
fit(e.g.,coef_,intercept_). By convention, these are stored as attributes with a trailing underscore.