Principle:Scikit learn Scikit learn Metric Evaluation
| Field | Value |
|---|---|
| sources | Sokolova, M. and Lapalme, G. (2009). A systematic analysis of performance measures for classification tasks. Information Processing and Management, 45(4), 427-437; scikit-learn documentation: https://scikit-learn.org/stable/modules/model_evaluation.html |
| domains | Machine_Learning, Statistics, Data_Science |
| last_updated | 2026-02-08 15:00 GMT |
Overview
A quantitative assessment framework that measures classifier performance against known ground truth.
Description
Metric evaluation is the process of computing numerical scores that summarize how well a classifier's predictions match the true labels. These metrics serve as the objective criteria for model selection, hyperparameter tuning, and reporting final results.
The most commonly used classification metrics include:
- Accuracy -- The fraction of predictions that are correct. Simple and intuitive but can be misleading on imbalanced datasets where a naive majority-class classifier achieves high accuracy.
- Precision -- Of all samples predicted as a given class, the fraction that truly belong to that class. High precision means few false positives.
- Recall (Sensitivity) -- Of all samples that truly belong to a given class, the fraction that were correctly identified. High recall means few false negatives.
- F1 Score -- The harmonic mean of precision and recall, providing a single number that balances both concerns. Defined as .
- Confusion Matrix -- A table of shape (n_classes, n_classes) where entry counts the number of samples known to be in class but predicted as class . The diagonal entries represent correct predictions.
Usage
Use metric evaluation when:
- Assessing model quality -- After training and prediction, compute metrics on the held-out test set to estimate generalization performance.
- Comparing models -- Use consistent metrics to compare different algorithms or hyperparameter settings on the same test data.
- Diagnosing errors -- The confusion matrix reveals which classes are being confused with each other, guiding model improvement.
- Reporting results -- Classification reports provide a per-class breakdown of precision, recall, and F1, which is essential for communicating model behavior to stakeholders.
Theoretical Basis
True/False Positives and Negatives
For a given class, each prediction falls into one of four categories:
| Predicted Positive | Predicted Negative | |
|---|---|---|
| Actually Positive | True Positive (TP) | False Negative (FN) |
| Actually Negative | False Positive (FP) | True Negative (TN) |
From these counts, the core metrics are derived:
Per-Class vs. Aggregated Metrics
In multiclass settings, precision, recall, and F1 are first computed for each class individually (treating that class as the "positive" class in a one-vs-rest fashion). These per-class scores are then aggregated into a single number using one of several averaging strategies:
- Macro averaging -- Compute the metric independently for each class and then take the unweighted mean. This gives equal importance to every class regardless of its frequency.
- Micro averaging -- Aggregate the TP, FP, and FN counts across all classes and then compute the metric from the aggregated counts. This is equivalent to accuracy for single-label classification.
- Weighted averaging -- Like macro, but each class's metric is weighted by its support (number of true instances). This accounts for class imbalance.
Confusion Matrix
The confusion matrix of shape for classes is defined as:
A perfect classifier produces a diagonal confusion matrix. Off-diagonal entries indicate misclassifications and reveal systematic patterns of confusion between specific class pairs.