Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Workflow:Cleanlab Cleanlab Multiannotator Consensus

From Leeroopedia


Knowledge Sources
Domains Data_Centric_AI, Crowdsourcing, Annotation_Quality, Active_Learning
Last Updated 2026-02-09 19:00 GMT

Overview

End-to-end process for analyzing classification data labeled by multiple annotators to estimate consensus labels, label quality, and annotator reliability using cleanlab's CROWDLAB algorithm.

Description

This workflow uses cleanlab's multiannotator module to analyze datasets where multiple human annotators have provided labels for some or all examples. It combines individual annotations with model predictions to estimate: a consensus label for each example that is more accurate than majority vote, a quality score for each consensus label measuring confidence in its correctness, quality scores for each individual annotation, and an overall reliability score for each annotator. The CROWDLAB algorithm leverages a trained classifier's predictions as an additional "virtual annotator" to break ties and improve consensus estimates, particularly when annotators disagree.

Usage

Execute this workflow when you have classification data labeled by multiple annotators (crowdsourcing scenario) and want to determine the best consensus label for each example, identify which individual annotations are most likely incorrect, and assess annotator reliability. This is also the starting point for active learning with re-labeling, where you want to decide which examples to collect additional labels for using the ActiveLab scoring method.

Execution Steps

Step 1: Prepare Multi_Annotator Labels and Model Predictions

Format your multi-annotator labels as a DataFrame or 2D array where rows are examples and columns are annotators, with NaN/missing values for examples an annotator did not label. Separately obtain out-of-sample predicted probabilities from a trained classifier. Optionally, for ensemble methods, obtain pred_probs from multiple models.

Key considerations:

  • Labels format: DataFrame with shape (N, A) where N is examples and A is annotators
  • Missing entries (annotator did not label this example) should be NaN
  • Pred_probs should be out-of-sample with shape (N, K) for K classes
  • All annotators must use the same class indices (0, 1, ..., K-1)
  • For ensemble methods, pred_probs can be a 3D array of shape (P, N, K) for P models

Step 2: Compute Consensus and Quality Scores

Call get_label_quality_multiannotator with the multi-annotator labels and predicted probabilities. This runs the CROWDLAB algorithm to estimate consensus labels, per-example consensus quality scores, per-annotation quality scores, and overall annotator quality scores. The algorithm treats the classifier as an additional annotator and computes a weighted combination of all inputs.

Key considerations:

  • Consensus methods available: "best_quality" (recommended, uses CROWDLAB) or "majority_vote"
  • The quality_method parameter controls scoring; "crowdlab" is the default
  • Setting return_detailed_quality=True provides per-annotation scores
  • Setting return_annotator_stats=True provides per-annotator reliability metrics
  • Temperature scaling can optionally calibrate the model's predicted probabilities

Step 3: Assess Annotator Reliability

Extract and analyze the per-annotator quality scores from the returned results. These scores estimate each annotator's overall label accuracy, enabling identification of unreliable annotators whose labels should be weighted less or excluded from future labeling tasks.

Key considerations:

  • Annotator quality scores range from 0 to 1; lower means less reliable
  • Scores account for the difficulty of examples each annotator labeled
  • Unreliable annotators may need retraining or removal from the annotation pool
  • Agreement matrices between annotator pairs can reveal systematic biases

Step 4: Prioritize Data for Re_Labeling (Active Learning)

Optionally, call get_active_learning_scores to compute ActiveLab scores that prioritize which examples would benefit most from additional annotations. This is useful when your annotation budget is limited and you want to maximize the impact of each new label collected.

Key considerations:

  • ActiveLab scores indicate how informative an additional label would be for each example
  • Lower scores indicate examples where additional labels would be most valuable
  • Works in batch mode (label many examples at once) or sequential mode (one at a time)
  • Supports settings where some examples have no labels yet (cold-start active learning)

Step 5: Apply Consensus Labels

Replace the original noisy labels with the estimated consensus labels for model training. The consensus labels, combined with quality scores, provide a cleaner training dataset. Examples with very low consensus quality scores can be flagged for additional review or excluded entirely.

Key considerations:

  • Consensus labels outperform majority vote, especially when annotators disagree
  • Quality scores can be used as sample weights during training
  • Low-quality consensus examples may benefit from additional annotations rather than exclusion
  • The cleaned dataset can be used with any downstream ML pipeline

Execution Diagram

GitHub URL

Workflow Repository