Implementation:Scikit learn Scikit learn Score Distribution Pattern

Metadata

Domains: Statistics, Model_Evaluation
Type: Pattern Doc (user code pattern, not a dedicated library API)
Last Updated: 2026-02-08 15:00 GMT

Overview

User code pattern for computing summary statistics from cross-validation score arrays. Unlike the other implementations in this workflow, score distribution analysis is not a single scikit-learn function but rather a common user code pattern that uses NumPy operations on the arrays returned by cross_val_score or cross_validate.

Pattern: Basic Summary Statistics

From cross_val_score

cross_val_score returns a simple NumPy array of shape (n_splits,). Computing summary statistics is straightforward:

import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris

X, y = load_iris(return_X_y=True)
clf = RandomForestClassifier(n_estimators=100, random_state=42)

scores = cross_val_score(clf, X, y, cv=5, scoring='accuracy')

# Central tendency
mean_score = np.mean(scores)

# Variability
std_score = np.std(scores)

# Report
print(f"Accuracy: {mean_score:.3f} +/- {std_score:.3f}")
# e.g., Accuracy: 0.960 +/- 0.022

From cross_validate (multi-metric)

cross_validate returns a dictionary. Each metric's scores must be extracted by key:

import numpy as np
from sklearn.model_selection import cross_validate
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris

X, y = load_iris(return_X_y=True)
clf = RandomForestClassifier(n_estimators=100, random_state=42)

cv_results = cross_validate(
    clf, X, y, cv=5,
    scoring=['accuracy', 'f1_macro'],
    return_train_score=True
)

# Extract and summarize each metric
for metric in ['accuracy', 'f1_macro']:
    test_scores = cv_results[f'test_{metric}']
    train_scores = cv_results[f'train_{metric}']
    print(f"{metric}:")
    print(f"  Test:  {np.mean(test_scores):.3f} +/- {np.std(test_scores):.3f}")
    print(f"  Train: {np.mean(train_scores):.3f} +/- {np.std(train_scores):.3f}")

# Timing information
print(f"Fit time:   {np.mean(cv_results['fit_time']):.3f}s +/- {np.std(cv_results['fit_time']):.3f}s")
print(f"Score time: {np.mean(cv_results['score_time']):.3f}s +/- {np.std(cv_results['score_time']):.3f}s")

Pattern: Confidence Intervals

An approximate 95% confidence interval for the mean score, assuming approximately normal fold scores:

import numpy as np

def confidence_interval(scores, confidence=0.95):
    """Compute a confidence interval for the mean of cross-validation scores."""
    n = len(scores)
    mean = np.mean(scores)
    std_err = np.std(scores, ddof=1) / np.sqrt(n)

    # For small n, use t-distribution critical value
    from scipy import stats
    t_crit = stats.t.ppf((1 + confidence) / 2, df=n - 1)

    margin = t_crit * std_err
    return mean, mean - margin, mean + margin

# Usage
scores = cross_val_score(clf, X, y, cv=10, scoring='accuracy')
mean, ci_low, ci_high = confidence_interval(scores)
print(f"Accuracy: {mean:.3f} (95% CI: [{ci_low:.3f}, {ci_high:.3f}])")

For a quick approximation without SciPy (using the normal distribution z=1.96 for 95% CI):

import numpy as np

scores = cross_val_score(clf, X, y, cv=10, scoring='accuracy')
mean = np.mean(scores)
std_err = np.std(scores, ddof=1) / np.sqrt(len(scores))

ci_low = mean - 1.96 * std_err
ci_high = mean + 1.96 * std_err
print(f"Accuracy: {mean:.3f} (95% CI: [{ci_low:.3f}, {ci_high:.3f}])")

Pattern: Comparing Two Models

import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.datasets import load_iris

X, y = load_iris(return_X_y=True)

# Use the same CV splits for a fair comparison
from sklearn.model_selection import StratifiedKFold
cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)

rf_scores = cross_val_score(RandomForestClassifier(random_state=42), X, y, cv=cv)
gb_scores = cross_val_score(GradientBoostingClassifier(random_state=42), X, y, cv=cv)

print(f"Random Forest:       {np.mean(rf_scores):.3f} +/- {np.std(rf_scores):.3f}")
print(f"Gradient Boosting:   {np.mean(gb_scores):.3f} +/- {np.std(gb_scores):.3f}")

# Paired difference across folds
diff = rf_scores - gb_scores
print(f"Mean difference:     {np.mean(diff):.3f} +/- {np.std(diff):.3f}")

Key Considerations

Use ddof=1 for sample standard deviation when computing confidence intervals (Bessel's correction). The default np.std() uses ddof=0 (population standard deviation), which slightly underestimates variability with small k.
Fold score correlation: Because training sets overlap across folds, the fold scores are not truly independent. Confidence intervals based on the independence assumption tend to be somewhat optimistic (too narrow). Be cautious when interpreting narrow intervals with small datasets.
Paired comparisons: When comparing two models, compute the per-fold score differences and analyze those differences. Using the same CV splits for both models controls for fold-level variability.

Related Pages

Principle:Scikit_learn_Scikit_learn_Score_Distribution_Analysis

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment