Principle:Scikit learn Scikit learn Search Results Analysis
Overview
An interpretation process that extracts the best model configuration and performance statistics from a completed hyperparameter search.
Description
What Are Search Results?
After a hyperparameter search completes, it produces a rich set of results that capture the performance of every candidate configuration across all cross-validation folds. These results serve two purposes: (1) identifying the best configuration to deploy, and (2) providing diagnostic information about how different hyperparameters affect model performance.
The results are stored in a structured dictionary (cv_results_) that can be converted to a pandas DataFrame for tabular analysis, visualization, and further exploration.
Key Result Components
The results of a hyperparameter search consist of several categories of information:
- Best configuration attributes -- The parameter combination that achieved the highest mean cross-validated score, along with its score and index.
best_params_-- the dictionary of optimal hyperparameter values.best_score_-- the mean cross-validated score achieved by the best configuration.best_index_-- the index intocv_results_arrays for the best configuration.best_estimator_-- the estimator refitted on the full dataset with the best parameters.
- Per-candidate statistics -- For every candidate configuration:
- Per-split test scores (
split0_test_score,split1_test_score, etc.). - Mean and standard deviation of test scores (
mean_test_score,std_test_score). - Rank among all candidates (
rank_test_score). - Optionally, training scores in the same format.
- Per-split test scores (
- Parameter values -- The specific hyperparameter values used for each candidate, stored both as individual masked arrays (
param_C,param_kernel, etc.) and as a list of dictionaries (params).
- Timing information -- Fit time and score time for each candidate, with mean and standard deviation across folds.
How to Compare Candidates
Effective analysis of search results involves:
- Ranking by mean score -- The primary criterion. Candidates are ranked by
mean_test_scorein descending order (or ascending for loss functions). - Variance analysis -- A candidate with a high mean score but also high standard deviation across folds may be unstable. Comparing
std_test_scorehelps identify configurations that generalize reliably. - Overfitting diagnosis -- When
return_train_score=True, comparingmean_train_scoretomean_test_scorereveals overfitting (large gap) or underfitting (both scores low). - Timing considerations -- In production, a slightly worse-scoring configuration that trains much faster may be preferable. The
mean_fit_timecolumn supports this tradeoff analysis. - Parameter sensitivity -- By examining how scores vary with individual parameters, practitioners can identify which hyperparameters have the largest impact on performance.
Usage
Search results analysis is performed after calling fit on a search estimator. The typical workflow is:
- Access
best_params_andbest_score_for the top-line result. - Convert
cv_results_to a pandas DataFrame for detailed exploration. - Sort, filter, and visualize the DataFrame to understand parameter effects and tradeoffs.
Theoretical Basis
Ranking Candidates by Mean CV Score
The mean cross-validated score is the standard criterion for selecting the best candidate. For K folds, the mean score for candidate c is:
mean_score(c) = (1/K) * sum(score(c, k) for k in 1..K)
scikit-learn uses scipy.stats.rankdata with method="min" to rank candidates. Ties (candidates with identical mean scores) receive the same (minimum) rank. Candidates whose scores are NaN (due to fit failures) are assigned the worst rank.
Variance Analysis Across Folds
The standard deviation of scores across folds provides a measure of stability:
std_score(c) = sqrt((1/K) * sum((score(c, k) - mean_score(c))^2 for k in 1..K))
A high standard deviation indicates that the configuration's performance depends heavily on the particular train/test split, suggesting potential issues with:
- Small dataset size -- high variance is inherent with limited data.
- Model sensitivity -- the configuration may be near a decision boundary in hyperparameter space.
- Overfitting -- the configuration may memorize training data in some folds but fail on others.
A common practical heuristic is the one-standard-error rule: rather than choosing the configuration with the absolute best mean score, choose the simplest configuration whose mean score is within one standard error of the best. This can be implemented via a custom refit callable.
Masked Arrays for Heterogeneous Grids
When using multiple sub-grids (e.g., [{'kernel': ['linear']}, {'kernel': ['rbf'], 'gamma': [0.1, 1]}]), not all parameters apply to all candidates. scikit-learn uses numpy masked arrays to represent parameter values, where entries are masked for candidates to which a parameter does not apply. This allows the full results table to have a consistent structure even with heterogeneous parameter spaces.