Implementation:Scikit learn Scikit learn BaseSearchCV Fit

Overview

Concrete tool for executing hyperparameter search by fitting across all candidates and CV folds provided by scikit-learn.

The BaseSearchCV.fit method is the central execution engine for all search-based hyperparameter tuners in scikit-learn. It orchestrates the parallel clone-fit-score loop, aggregates results, selects the best configuration, and optionally refits a final estimator on the full dataset.

Code Reference

Method Signature

def fit(self, X, y=None, **params):
    """Run fit with all sets of parameters.

    Parameters
    ----------
    X : array-like of shape (n_samples, n_features) or (n_samples, n_samples)
        Training vectors, where n_samples is the number of samples and
        n_features is the number of features. For precomputed kernel or
        distance matrix, the expected shape of X is (n_samples, n_samples).

    y : array-like of shape (n_samples, n_output)
        or (n_samples,), default=None
        Target relative to X for classification or regression;
        None for unsupervised learning.

    **params : dict of str -> object
        Parameters passed to the fit method of the estimator, the scorer,
        and the CV splitter.

    Returns
    -------
    self : object
        Instance of fitted estimator.
    """

I/O Contract

Input:

X -- training feature matrix, array-like of shape (n_samples, n_features).
y -- target values, array-like of shape (n_samples,) or (n_samples, n_output). May be None for unsupervised estimators.
**params -- additional parameters routed to the estimator's fit, the scorer's score, or the CV splitter's split (e.g., sample_weight, groups).

Output:

Returns self (the fitted search object), with the following attributes populated:

Attribute	Type	Description
`cv_results_`	`dict` of numpy arrays	Comprehensive results dictionary containing per-split scores, mean/std aggregations, rankings, fit/score times, and parameter values for every candidate.
`best_params_`	`dict`	The parameter configuration that achieved the highest mean test score (or best rank).
`best_score_`	`float`	The mean cross-validated score of the best candidate. Not available when `refit` is a callable.
`best_index_`	`int`	The index into `cv_results_` arrays corresponding to the best candidate.
`best_estimator_`	estimator	A clone of the base estimator, fitted on the full dataset with `best_params_`. Only available when `refit=True`.
`n_splits_`	`int`	The number of cross-validation splits used.
`refit_time_`	`float`	Seconds spent refitting on the full dataset. Only available when `refit` is not `False`.
`multimetric_`	`bool`	Whether multiple scoring metrics were used.
`scorer_`	`function` or `dict`	The scorer(s) used. A dict for multi-metric evaluation.

Execution Flow

The fit method proceeds through these stages:

1. Setup:

estimator = self.estimator
scorers, refit_metric = self._get_scorers()
X, y = indexable(X, y)
params = _check_method_params(X, params=params)
routed_params = self._get_routed_params_for_fit(params)
cv_orig = check_cv(self.cv, y, classifier=is_classifier(estimator))
n_splits = cv_orig.get_n_splits(X, y, **routed_params.splitter.split)
base_estimator = clone(self.estimator)

2. Parallel evaluation via evaluate_candidates callback:

parallel = Parallel(n_jobs=self.n_jobs, pre_dispatch=self.pre_dispatch)

def evaluate_candidates(candidate_params, cv=None, more_results=None):
    cv = cv or cv_orig
    candidate_params = list(candidate_params)
    n_candidates = len(candidate_params)

    out = parallel(
        delayed(_fit_and_score)(
            clone(base_estimator), X, y,
            train=train, test=test, parameters=parameters,
            split_progress=(split_idx, n_splits),
            candidate_progress=(cand_idx, n_candidates),
            **fit_and_score_kwargs,
        )
        for (cand_idx, parameters), (split_idx, (train, test)) in product(
            enumerate(candidate_params),
            enumerate(cv.split(X, y, **routed_params.splitter.split)),
        )
    )
    ...
    results = self._format_results(
        all_candidate_params, n_splits, all_out, all_more_results
    )
    return results

3. Best selection and refit:

# Select best
self.best_index_ = self._select_best_index(self.refit, refit_metric, results)
self.best_score_ = results[f"mean_test_{refit_metric}"][self.best_index_]
self.best_params_ = results["params"][self.best_index_]

# Refit on full data
self.best_estimator_ = clone(base_estimator).set_params(
    **clone(self.best_params_, safe=False)
)
self.best_estimator_.fit(X, y, **routed_params.estimator.fit)

4. Store final results:

self.cv_results_ = results
self.n_splits_ = n_splits
return self

Usage Examples

from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.datasets import load_iris

X, y = load_iris(return_X_y=True)

search = GridSearchCV(
    SVC(),
    param_grid={'C': [0.1, 1, 10], 'kernel': ['linear', 'rbf']},
    cv=5,
    n_jobs=-1,
    scoring='accuracy',
    return_train_score=True,
)
search.fit(X, y)

# After fit, all result attributes are available
print(search.best_params_)       # {'C': 1, 'kernel': 'linear'}
print(search.best_score_)        # 0.98
print(search.best_estimator_)    # SVC(C=1, kernel='linear')
print(search.n_splits_)          # 5
print(search.refit_time_)        # 0.002 (seconds)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment