Principle:Scikit learn Scikit learn Search Results Analysis

Overview

An interpretation process that extracts the best model configuration and performance statistics from a completed hyperparameter search.

Description

What Are Search Results?

After a hyperparameter search completes, it produces a rich set of results that capture the performance of every candidate configuration across all cross-validation folds. These results serve two purposes: (1) identifying the best configuration to deploy, and (2) providing diagnostic information about how different hyperparameters affect model performance.

The results are stored in a structured dictionary (cv_results_) that can be converted to a pandas DataFrame for tabular analysis, visualization, and further exploration.

Key Result Components

The results of a hyperparameter search consist of several categories of information:

Best configuration attributes -- The parameter combination that achieved the highest mean cross-validated score, along with its score and index.
- best_params_ -- the dictionary of optimal hyperparameter values.
- best_score_ -- the mean cross-validated score achieved by the best configuration.
- best_index_ -- the index into cv_results_ arrays for the best configuration.
- best_estimator_ -- the estimator refitted on the full dataset with the best parameters.

Per-candidate statistics -- For every candidate configuration:
- Per-split test scores (split0_test_score, split1_test_score, etc.).
- Mean and standard deviation of test scores (mean_test_score, std_test_score).
- Rank among all candidates (rank_test_score).
- Optionally, training scores in the same format.

Parameter values -- The specific hyperparameter values used for each candidate, stored both as individual masked arrays (param_C, param_kernel, etc.) and as a list of dictionaries (params).

Timing information -- Fit time and score time for each candidate, with mean and standard deviation across folds.

How to Compare Candidates

Effective analysis of search results involves:

Ranking by mean score -- The primary criterion. Candidates are ranked by mean_test_score in descending order (or ascending for loss functions).
Variance analysis -- A candidate with a high mean score but also high standard deviation across folds may be unstable. Comparing std_test_score helps identify configurations that generalize reliably.
Overfitting diagnosis -- When return_train_score=True, comparing mean_train_score to mean_test_score reveals overfitting (large gap) or underfitting (both scores low).
Timing considerations -- In production, a slightly worse-scoring configuration that trains much faster may be preferable. The mean_fit_time column supports this tradeoff analysis.
Parameter sensitivity -- By examining how scores vary with individual parameters, practitioners can identify which hyperparameters have the largest impact on performance.

Usage

Search results analysis is performed after calling fit on a search estimator. The typical workflow is:

Access best_params_ and best_score_ for the top-line result.
Convert cv_results_ to a pandas DataFrame for detailed exploration.
Sort, filter, and visualize the DataFrame to understand parameter effects and tradeoffs.

Theoretical Basis

Ranking Candidates by Mean CV Score

The mean cross-validated score is the standard criterion for selecting the best candidate. For K folds, the mean score for candidate c is:

mean_score(c) = (1/K) * sum(score(c, k) for k in 1..K)

scikit-learn uses scipy.stats.rankdata with method="min" to rank candidates. Ties (candidates with identical mean scores) receive the same (minimum) rank. Candidates whose scores are NaN (due to fit failures) are assigned the worst rank.

Variance Analysis Across Folds

The standard deviation of scores across folds provides a measure of stability:

std_score(c) = sqrt((1/K) * sum((score(c, k) - mean_score(c))^2 for k in 1..K))

A high standard deviation indicates that the configuration's performance depends heavily on the particular train/test split, suggesting potential issues with:

Small dataset size -- high variance is inherent with limited data.
Model sensitivity -- the configuration may be near a decision boundary in hyperparameter space.
Overfitting -- the configuration may memorize training data in some folds but fail on others.

A common practical heuristic is the one-standard-error rule: rather than choosing the configuration with the absolute best mean score, choose the simplest configuration whose mean score is within one standard error of the best. This can be implemented via a custom refit callable.

Masked Arrays for Heterogeneous Grids

When using multiple sub-grids (e.g., [{'kernel': ['linear']}, {'kernel': ['rbf'], 'gamma': [0.1, 1]}]), not all parameters apply to all candidates. scikit-learn uses numpy masked arrays to represent parameter values, where entries are masked for candidates to which a parameter does not apply. This allows the full results table to have a consistent structure even with heterogeneous parameter spaces.

Related Pages

Implementation:Scikit_learn_Scikit_learn_CV_Results_Attributes

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment