Principle:Rapidsai Cuml Classification Evaluation

Knowledge Sources	Sokolova and Lapalme 2009 - A systematic analysis of performance measures for classification tasks Manning et al. 2008 - Introduction to Information Retrieval cuML Docs
Domains	Machine_Learning, Classification, Evaluation
Last Updated	2026-02-08 12:00 GMT

Overview

Classification evaluation is the quantitative assessment of classifier performance using metrics such as accuracy, log loss, confusion matrices, and hinge loss to measure how well a model assigns categorical labels to data points.

Description

Classification models assign discrete labels to input data, and evaluating their quality requires metrics that capture different aspects of prediction correctness. Unlike regression evaluation where errors are continuous, classification evaluation involves counting correct and incorrect predictions and analyzing the distribution of errors across classes.

Accuracy: The simplest classification metric: the fraction of predictions that exactly match the true label. Accuracy is intuitive and easy to communicate but can be misleading for imbalanced datasets. If 95% of samples belong to class A, a trivial classifier that always predicts A achieves 95% accuracy despite being useless for detecting class B.

Log Loss (Cross-Entropy Loss): Measures the quality of probabilistic predictions by penalizing confident wrong predictions more severely than uncertain ones. Log loss requires the model to output class probabilities rather than hard labels. It is the standard loss function for training logistic regression and neural network classifiers, and it serves as an evaluation metric that rewards well-calibrated probability estimates.

Confusion Matrix: A table that breaks down predictions into true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN) for each class. The confusion matrix is the foundation from which many derived metrics are computed: precision (TP / (TP + FP)), recall (TP / (TP + FN)), F1-score (harmonic mean of precision and recall), and specificity (TN / (TN + FP)). For multiclass problems, the confusion matrix is a k-by-k grid where entry (i, j) counts samples with true label i that were predicted as label j.

Hinge Loss: The loss function associated with maximum-margin classifiers such as Support Vector Machines. Hinge loss is zero when the prediction is correct with sufficient margin and grows linearly as the prediction moves into the incorrect side of the margin. It encourages not just correct classification but confident correct classification. As an evaluation metric, the average hinge loss indicates how well the model separates classes with margin.

Usage

Classification evaluation metrics are used when:

Selecting between competing classifiers on the same dataset.
Tuning hyperparameters (e.g., regularization strength, learning rate) to optimize a specific metric.
Diagnosing model weaknesses by examining the confusion matrix to identify which classes are frequently confused.
Evaluating probabilistic calibration: use log loss when well-calibrated probabilities are important (e.g., risk scoring).
Working with imbalanced classes: accuracy is insufficient; examine per-class precision, recall, and the full confusion matrix.
Evaluating margin-based classifiers: hinge loss directly measures margin quality.

Theoretical Basis

Accuracy:

$Accuracy = \frac{1}{n} \sum_{i = 1}^{n} 𝟏 [{\hat{y}}_{i} = y_{i}]$

where $𝟏 [\cdot]$ is the indicator function.

Log Loss (Binary):

$LogLoss = - \frac{1}{n} \sum_{i = 1}^{n} [y_{i} \log ({\hat{p}}_{i}) + (1 - y_{i}) \log (1 - {\hat{p}}_{i})]$

Log Loss (Multiclass):

$LogLoss = - \frac{1}{n} \sum_{i = 1}^{n} \sum_{c = 1}^{C} y_{i c} \log ({\hat{p}}_{i c})$

where $y_{i c}$ is 1 if sample i belongs to class c and 0 otherwise, and ${\hat{p}}_{i c}$ is the predicted probability for class c.

Confusion Matrix:

For binary classification:
                    Predicted Positive    Predicted Negative
Actual Positive         TP                    FN
Actual Negative         FP                    TN

Precision = TP / (TP + FP)
Recall    = TP / (TP + FN)
F1        = 2 * Precision * Recall / (Precision + Recall)

For multiclass with k classes:
    CM[i][j] = count of samples with true label i, predicted label j
    CM is a k x k matrix; diagonal entries are correct predictions

Hinge Loss:

$HingeLoss = \frac{1}{n} \sum_{i = 1}^{n} \max (0, 1 - y_{i} \cdot \hat{f} (x_{i}))$

where $y_{i} \in {- 1, + 1}$ and $\hat{f} (x_{i})$ is the raw decision function output (not a probability).

GPU Computation:

Accuracy:
    matches = (y_pred == y_true)            (element-wise comparison, GPU parallel)
    accuracy = sum(matches) / n             (GPU reduction)

Log Loss:
    eps = 1e-15                             (clip to avoid log(0))
    p_clipped = clip(y_prob, eps, 1 - eps)
    losses = -y_true * log(p_clipped) - (1 - y_true) * log(1 - p_clipped)
    log_loss = mean(losses)                 (GPU reduction)

Confusion Matrix:
    For each sample i:
        CM[y_true[i]][y_pred[i]] += 1       (GPU atomic increment)

Hinge Loss:
    margins = y_true * y_score              (element-wise, GPU parallel)
    losses = max(0, 1 - margins)
    hinge_loss = mean(losses)               (GPU reduction)

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment