Principle:Online ml River Multi Output Metrics
| Knowledge Sources | Domains | Last Updated |
|---|---|---|
| Machine Learning Multi-Label Classification | Online_Learning, Evaluation, Multi_Label_Learning | 2026-02-08 18:00 GMT |
Overview
Multi-output evaluation metrics assess models that predict multiple target variables simultaneously, such as multi-label classifiers or multi-target regressors. These metrics aggregate per-output performance using strategies like macro-averaging, micro-averaging, sample-averaging, or per-output reporting, each providing a different perspective on model quality.
Description
Many real-world prediction tasks involve multiple outputs: a document may have several topic labels (multi-label classification), or a system may predict multiple target values simultaneously (multi-target regression). Evaluating such models requires metrics that can handle sets or vectors of predictions rather than single values.
Key multi-output evaluation strategies include:
Exact Match: The strictest metric for multi-label classification. It considers a prediction correct only if the entire predicted label set exactly matches the true label set. This is a harsh measure because getting even one label wrong results in zero credit for that instance.
Macro-averaging: Computes the metric independently for each output/label and then takes the unweighted mean. This gives equal importance to each label regardless of its frequency, making it sensitive to performance on rare labels.
Micro-averaging: Aggregates the contributions of all labels into a single pool of TP, FP, FN counts, then computes the metric on the aggregated counts. This gives more weight to frequent labels and reflects the overall per-label-assignment accuracy.
Sample-averaging: Computes the metric for each instance (across all labels) and then averages over instances. This reflects how well the model does on a typical prediction.
Per-output reporting: Reports the metric value for each output separately, without aggregation. This is useful for identifying which outputs are well-predicted and which need improvement.
Multi-label confusion matrix: Extends the binary confusion matrix to the multi-label setting by maintaining per-label TP, FP, TN, FN counts.
Rand Index: Measures the similarity between predicted and true label sets by comparing all pairs of instances and counting agreements in their label assignments.
Usage
Use multi-output metrics when:
- Your model produces multiple predictions per instance (multi-label or multi-target).
- You need to understand model performance at the label level, instance level, or globally.
- You want to compare different aggregation strategies to understand trade-offs.
- You are working with streaming multi-label data and need incremental evaluation.
Theoretical Basis
Aggregation Strategies
Macro-average:
M_macro = (1/L) * sum_{l=1}^{L} M_l
where M_l is the metric computed only on label l
Micro-average:
M_micro = M(sum_l TP_l, sum_l FP_l, sum_l FN_l)
where TP_l, FP_l, FN_l are counts for label l
Sample-average:
M_sample = (1/N) * sum_{i=1}^{N} M(y_i, hat{y}_i)
where M(y_i, hat{y}_i) is computed across all labels for instance i
Exact Match
ExactMatch = (1/N) * sum_{i=1}^{N} I(y_i == hat{y}_i)
where I() is the indicator function and y_i, hat{y}_i are full label sets
Rand Index
For all pairs of instances (i, j):
a = pairs where both have same labels in truth AND prediction
b = pairs where both have different labels in truth AND prediction
RandIndex = (a + b) / C(N, 2)
Streaming Updates
In the online setting, all these strategies maintain their underlying counts incrementally:
For each multi-label instance (x, y_true, y_pred):
For each label l:
if l in y_true and l in y_pred: TP_l += 1
elif l in y_pred: FP_l += 1
elif l in y_true: FN_l += 1
else: TN_l += 1
Recompute desired aggregated metric from updated counts
Related Pages
- Implementation:Online_ml_River_Metrics_ExactMatch
- Implementation:Online_ml_River_Metrics_MacroAverage
- Implementation:Online_ml_River_Metrics_MicroAverage
- Implementation:Online_ml_River_Metrics_MultiLabelConfusionMatrix
- Implementation:Online_ml_River_Metrics_PerOutput
- Implementation:Online_ml_River_Metrics_SampleAverage
- Implementation:Online_ml_River_Metrics_RandIndex