Principle:Online ml River Multi Output Metrics

Knowledge Sources	Domains	Last Updated
Machine Learning Multi-Label Classification	Online_Learning, Evaluation, Multi_Label_Learning	2026-02-08 18:00 GMT

Overview

Multi-output evaluation metrics assess models that predict multiple target variables simultaneously, such as multi-label classifiers or multi-target regressors. These metrics aggregate per-output performance using strategies like macro-averaging, micro-averaging, sample-averaging, or per-output reporting, each providing a different perspective on model quality.

Description

Many real-world prediction tasks involve multiple outputs: a document may have several topic labels (multi-label classification), or a system may predict multiple target values simultaneously (multi-target regression). Evaluating such models requires metrics that can handle sets or vectors of predictions rather than single values.

Key multi-output evaluation strategies include:

Exact Match: The strictest metric for multi-label classification. It considers a prediction correct only if the entire predicted label set exactly matches the true label set. This is a harsh measure because getting even one label wrong results in zero credit for that instance.

Macro-averaging: Computes the metric independently for each output/label and then takes the unweighted mean. This gives equal importance to each label regardless of its frequency, making it sensitive to performance on rare labels.

Micro-averaging: Aggregates the contributions of all labels into a single pool of TP, FP, FN counts, then computes the metric on the aggregated counts. This gives more weight to frequent labels and reflects the overall per-label-assignment accuracy.

Sample-averaging: Computes the metric for each instance (across all labels) and then averages over instances. This reflects how well the model does on a typical prediction.

Per-output reporting: Reports the metric value for each output separately, without aggregation. This is useful for identifying which outputs are well-predicted and which need improvement.

Multi-label confusion matrix: Extends the binary confusion matrix to the multi-label setting by maintaining per-label TP, FP, TN, FN counts.

Rand Index: Measures the similarity between predicted and true label sets by comparing all pairs of instances and counting agreements in their label assignments.

Usage

Use multi-output metrics when:

Your model produces multiple predictions per instance (multi-label or multi-target).
You need to understand model performance at the label level, instance level, or globally.
You want to compare different aggregation strategies to understand trade-offs.
You are working with streaming multi-label data and need incremental evaluation.

Theoretical Basis

Aggregation Strategies

Macro-average:
    M_macro = (1/L) * sum_{l=1}^{L} M_l
    where M_l is the metric computed only on label l

Micro-average:
    M_micro = M(sum_l TP_l, sum_l FP_l, sum_l FN_l)
    where TP_l, FP_l, FN_l are counts for label l

Sample-average:
    M_sample = (1/N) * sum_{i=1}^{N} M(y_i, hat{y}_i)
    where M(y_i, hat{y}_i) is computed across all labels for instance i

Exact Match

ExactMatch = (1/N) * sum_{i=1}^{N} I(y_i == hat{y}_i)
    where I() is the indicator function and y_i, hat{y}_i are full label sets

Rand Index

For all pairs of instances (i, j):
    a = pairs where both have same labels in truth AND prediction
    b = pairs where both have different labels in truth AND prediction
    RandIndex = (a + b) / C(N, 2)

Streaming Updates

In the online setting, all these strategies maintain their underlying counts incrementally:

For each multi-label instance (x, y_true, y_pred):
    For each label l:
        if l in y_true and l in y_pred: TP_l += 1
        elif l in y_pred:               FP_l += 1
        elif l in y_true:               FN_l += 1
        else:                           TN_l += 1
    Recompute desired aggregated metric from updated counts

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment