Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Scikit learn Scikit learn Metric Evaluation

From Leeroopedia


Field Value
sources Sokolova, M. and Lapalme, G. (2009). A systematic analysis of performance measures for classification tasks. Information Processing and Management, 45(4), 427-437; scikit-learn documentation: https://scikit-learn.org/stable/modules/model_evaluation.html
domains Machine_Learning, Statistics, Data_Science
last_updated 2026-02-08 15:00 GMT

Overview

A quantitative assessment framework that measures classifier performance against known ground truth.

Description

Metric evaluation is the process of computing numerical scores that summarize how well a classifier's predictions match the true labels. These metrics serve as the objective criteria for model selection, hyperparameter tuning, and reporting final results.

The most commonly used classification metrics include:

  • Accuracy -- The fraction of predictions that are correct. Simple and intuitive but can be misleading on imbalanced datasets where a naive majority-class classifier achieves high accuracy.
  • Precision -- Of all samples predicted as a given class, the fraction that truly belong to that class. High precision means few false positives.
  • Recall (Sensitivity) -- Of all samples that truly belong to a given class, the fraction that were correctly identified. High recall means few false negatives.
  • F1 Score -- The harmonic mean of precision and recall, providing a single number that balances both concerns. Defined as F1=2precisionrecallprecision+recall.
  • Confusion Matrix -- A table of shape (n_classes, n_classes) where entry Ci,j counts the number of samples known to be in class i but predicted as class j. The diagonal entries represent correct predictions.

Usage

Use metric evaluation when:

  • Assessing model quality -- After training and prediction, compute metrics on the held-out test set to estimate generalization performance.
  • Comparing models -- Use consistent metrics to compare different algorithms or hyperparameter settings on the same test data.
  • Diagnosing errors -- The confusion matrix reveals which classes are being confused with each other, guiding model improvement.
  • Reporting results -- Classification reports provide a per-class breakdown of precision, recall, and F1, which is essential for communicating model behavior to stakeholders.

Theoretical Basis

True/False Positives and Negatives

For a given class, each prediction falls into one of four categories:

Predicted Positive Predicted Negative
Actually Positive True Positive (TP) False Negative (FN)
Actually Negative False Positive (FP) True Negative (TN)

From these counts, the core metrics are derived:

  • Accuracy=TP+TNTP+TN+FP+FN
  • Precision=TPTP+FP
  • Recall=TPTP+FN
  • F1=2TP2TP+FP+FN

Per-Class vs. Aggregated Metrics

In multiclass settings, precision, recall, and F1 are first computed for each class individually (treating that class as the "positive" class in a one-vs-rest fashion). These per-class scores are then aggregated into a single number using one of several averaging strategies:

  • Macro averaging -- Compute the metric independently for each class and then take the unweighted mean. This gives equal importance to every class regardless of its frequency.
  • Micro averaging -- Aggregate the TP, FP, and FN counts across all classes and then compute the metric from the aggregated counts. This is equivalent to accuracy for single-label classification.
  • Weighted averaging -- Like macro, but each class's metric is weighted by its support (number of true instances). This accounts for class imbalance.

Confusion Matrix

The confusion matrix 𝐂 of shape (K,K) for K classes is defined as:

Ci,j=|{k:yktrue=i and ykpred=j}|

A perfect classifier produces a diagonal confusion matrix. Off-diagonal entries indicate misclassifications and reveal systematic patterns of confusion between specific class pairs.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment