Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Cleanlab Cleanlab Multiannotator Consensus Estimation

From Leeroopedia


Knowledge Sources CROWDLAB, Cleanlab
Domains Machine_Learning, Data_Quality, Crowdsourcing
Last Updated 2026-02-09

Overview

Algorithm (CROWDLAB) for estimating consensus labels and quality scores from noisy crowd-sourced annotations by weighting annotators based on their reliability.

Description

CROWDLAB combines multiple annotator labels with model predictions to estimate the true label for each example. It addresses the fundamental challenge of crowdsourcing: individual annotators are noisy and may have varying levels of expertise, so simple majority voting is often suboptimal.

The algorithm operates by:

  • Modeling annotator quality: Each annotator receives a weight that reflects how often they agree with the estimated consensus. More reliable annotators receive higher weights.
  • Incorporating model predictions: A trained classifier's predictions are treated as an additional "annotator" with its own weight, calibrated against the annotator ensemble.
  • Iterative refinement: Consensus labels and annotator weights are iteratively refined to produce robust estimates.

The output includes:

  • Consensus labels: The best estimate of the true label for each example.
  • Per-example quality scores: How confident we are in each consensus label.
  • Annotator statistics: Reliability scores for each annotator, enabling identification of low-quality annotators.
  • Optionally: The learned weights for the model and each annotator.

Two consensus methods are supported:

  • best_quality: Uses the CROWDLAB algorithm to weight annotators and model predictions for optimal consensus quality.
  • majority_vote: Uses simple majority voting among annotators as a baseline, without incorporating model predictions.

Usage

Multiannotator consensus estimation is used in crowdsourcing workflows where multiple annotators label each example. Typical applications include:

  • Label aggregation: Combining noisy crowd labels into a single high-quality consensus label per example.
  • Annotator evaluation: Identifying unreliable annotators whose labels should be weighted less or who need additional training.
  • Quality control: Flagging examples with low consensus quality for additional review or re-annotation.
  • Training data preparation: Producing clean training labels from crowdsourced annotations for downstream model training.

Theoretical Basis

The CROWDLAB ensemble method combines annotator votes and model predictions using learned weights.

Step 1: Initial Consensus. Compute an initial consensus estimate using majority voting across annotators.

Step 2: Annotator Weight Estimation. For each annotator, compute a quality weight based on agreement with the current consensus:

annotator_weight[a] = (number of examples where annotator a agrees with consensus)
                      / (number of examples annotated by annotator a)

Step 3: Model Weight Calibration. Calibrate the model's weight relative to the annotator ensemble. The model is treated as an additional annotator whose reliability is estimated from its agreement with the consensus. Optionally, temperature scaling is applied to calibrate the model's predicted probabilities.

Step 4: Weighted Consensus. For each example, compute the consensus as a weighted combination of annotator labels and model prediction:

For each example x:
    score(class_k) = model_weight * pred_probs[x][k]
                   + sum over annotators a who labeled x:
                       annotator_weight[a] * (1 if annotator_a_label == k else 0)
    consensus_label[x] = argmax_k(score(class_k))

Step 5: Quality Scoring. Per-example quality scores are computed based on agreement among annotators and model confidence:

quality_score[x] = weighted_agreement(annotator_labels[x], consensus_label[x], annotator_weights)

Higher quality scores indicate examples where annotators and the model are in strong agreement, while lower scores indicate contentious or ambiguous examples.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment