Principle:Scikit learn Scikit learn Metric Evaluation

Field	Value
sources	Sokolova, M. and Lapalme, G. (2009). A systematic analysis of performance measures for classification tasks. Information Processing and Management, 45(4), 427-437; scikit-learn documentation: https://scikit-learn.org/stable/modules/model_evaluation.html
domains	Machine_Learning, Statistics, Data_Science
last_updated	2026-02-08 15:00 GMT

Overview

A quantitative assessment framework that measures classifier performance against known ground truth.

Description

Metric evaluation is the process of computing numerical scores that summarize how well a classifier's predictions match the true labels. These metrics serve as the objective criteria for model selection, hyperparameter tuning, and reporting final results.

The most commonly used classification metrics include:

Accuracy -- The fraction of predictions that are correct. Simple and intuitive but can be misleading on imbalanced datasets where a naive majority-class classifier achieves high accuracy.
Precision -- Of all samples predicted as a given class, the fraction that truly belong to that class. High precision means few false positives.
Recall (Sensitivity) -- Of all samples that truly belong to a given class, the fraction that were correctly identified. High recall means few false negatives.
F1 Score -- The harmonic mean of precision and recall, providing a single number that balances both concerns. Defined as $F_{1} = 2 \cdot \frac{precision \cdot recall}{precision + recall}$ .
Confusion Matrix -- A table of shape (n_classes, n_classes) where entry $C_{i, j}$ counts the number of samples known to be in class $i$ but predicted as class $j$ . The diagonal entries represent correct predictions.

Usage

Use metric evaluation when:

Assessing model quality -- After training and prediction, compute metrics on the held-out test set to estimate generalization performance.
Comparing models -- Use consistent metrics to compare different algorithms or hyperparameter settings on the same test data.
Diagnosing errors -- The confusion matrix reveals which classes are being confused with each other, guiding model improvement.
Reporting results -- Classification reports provide a per-class breakdown of precision, recall, and F1, which is essential for communicating model behavior to stakeholders.

Theoretical Basis

True/False Positives and Negatives

For a given class, each prediction falls into one of four categories:

	Predicted Positive	Predicted Negative
Actually Positive	True Positive (TP)	False Negative (FN)
Actually Negative	False Positive (FP)	True Negative (TN)

From these counts, the core metrics are derived:

$Accuracy = \frac{T P + T N}{T P + T N + F P + F N}$
$Precision = \frac{T P}{T P + F P}$
$Recall = \frac{T P}{T P + F N}$
$F_{1} = \frac{2 \cdot T P}{2 \cdot T P + F P + F N}$

Per-Class vs. Aggregated Metrics

In multiclass settings, precision, recall, and F1 are first computed for each class individually (treating that class as the "positive" class in a one-vs-rest fashion). These per-class scores are then aggregated into a single number using one of several averaging strategies:

Macro averaging -- Compute the metric independently for each class and then take the unweighted mean. This gives equal importance to every class regardless of its frequency.
Micro averaging -- Aggregate the TP, FP, and FN counts across all classes and then compute the metric from the aggregated counts. This is equivalent to accuracy for single-label classification.
Weighted averaging -- Like macro, but each class's metric is weighted by its support (number of true instances). This accounts for class imbalance.

Confusion Matrix

The confusion matrix $𝐂$ of shape $(K, K)$ for $K$ classes is defined as:

$C_{i, j} = | {k : y_{k}^{true} = i and y_{k}^{pred} = j} |$

A perfect classifier produces a diagonal confusion matrix. Off-diagonal entries indicate misclassifications and reveal systematic patterns of confusion between specific class pairs.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment