Principle:Cleanlab Cleanlab Segmentation Label Quality Scoring

Knowledge Sources	Cleanlab
Domains	Data Quality, Machine Learning, Computer Vision, Semantic Segmentation
Last Updated	2026-02-09 00:00 GMT

Overview

Segmentation label quality scoring produces continuous per-pixel and per-image quality scores for semantic segmentation datasets by analyzing model prediction confidence and aggregating pixel-level signals to the image level using softmin.

Description

In semantic segmentation, each image contains a dense grid of pixel-level labels, and label errors can range from a single misclassified pixel to large incorrectly annotated regions. Label quality scoring for segmentation addresses two complementary needs: (1) identifying which specific pixels are most likely mislabeled, and (2) ranking entire images by their overall annotation quality so that human reviewers can prioritize the most problematic images.

The fundamental signal for pixel-level scoring is the model's predicted probability for the given label class at each pixel. If a well-trained model assigns low probability to a pixel's annotated class, that pixel is more likely to be mislabeled. This per-pixel confidence can then be aggregated to the image level using different strategies.

The key design challenge is how to aggregate pixel scores into a single image score. A simple mean would dilute the signal from a few badly mislabeled pixels among thousands of correct ones. Instead, the softmin aggregation provides a principled compromise: it weights the contribution of each pixel's score by a function of how bad that score is, emphasizing the worst pixels while still accounting for the overall distribution. The temperature parameter controls this emphasis, creating a tunable spectrum between "worst pixel determines the image score" (low temperature) and "average pixel determines the image score" (high temperature).

Usage

Segmentation label quality scoring is the right approach when:

You need continuous, rankable scores rather than binary issue labels.
You want to prioritize images for human review based on overall annotation quality.
You need both pixel-level granularity (which specific pixels are suspect) and image-level summarization (which images need attention).
You want efficient scoring without running full Confident Learning (use the softmin method).

For binary issue detection (which pixels are definitively mislabeled), use segmentation label issue filtering instead.

Theoretical Basis

Per-Pixel Quality Score

The per-pixel label quality score is simply the model's predicted probability for the given class:

pixel_score(i, h, w) = P(class = labels[i, h, w] | x[i, h, w])

where P is the model's softmax output for pixel (h, w) in image i. This score is 1.0 if the model is perfectly confident in the given label, and approaches 0 as the model becomes confident in a different class. Out-of-sample predictions (e.g., from cross-validation) produce more reliable scores than in-sample predictions.

Softmin Aggregation

To aggregate pixel-level scores into an image-level score, the softmin function computes a weighted average that emphasizes low-scoring pixels:

image_score = sum(pixel_scores * softmax((1 - pixel_scores) / temperature))

The softmax applied to (1 - pixel_scores) / temperature assigns higher weights to pixels with lower quality scores (since 1 - score is larger for worse pixels). The behavior depends on the temperature parameter:

Low temperature (approaching 0): The softmax sharpens, placing nearly all weight on the worst pixel. The image score converges to the minimum pixel score.
High temperature: The softmax flattens, distributing weight more evenly. The image score converges to the mean pixel score.
Default temperature (0.1): Provides a balance that strongly emphasizes the worst pixels while remaining smooth and differentiable.

This is more robust than a simple minimum (which is noisy and sensitive to a single outlier pixel) or a simple mean (which dilutes the signal from mislabeled regions).

Alternative: Issue-Count Method

The "num_pixel_issues" method takes a different approach by first running Confident Learning to identify binary label issues, then computing:

image_score = 1 - (number of issue pixels) / (total pixels)

This is more computationally expensive (requiring two passes through the data) but directly reflects the estimated number of label errors rather than model confidence. It is appropriate when you want scores that are interpretable as "fraction of correct pixels."

Related Pages

Implementation:Cleanlab_Cleanlab_Segmentation_Get_Label_Quality_Scores

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment