Principle:Snorkel team Snorkel Label Quality Evaluation
| Knowledge Sources | |
|---|---|
| Domains | Evaluation, Weak_Supervision, Metrics |
| Last Updated | 2026-02-14 20:00 GMT |
Overview
A methodology for evaluating the quality of programmatically generated labels by computing classification metrics against a gold-labeled development set.
Description
Label Quality Evaluation measures how well the labels produced by the label model (or baseline models) match ground truth labels on a held-out development set. This is the final validation step in the weak supervision pipeline before using the generated labels for downstream model training.
Evaluation compares predictions from:
- Label Model: Learned combination of LF votes
- Majority Label Voter: Simple majority vote baseline
- Random Voter: Uniform random baseline
Standard classification metrics (accuracy, F1, precision, recall, ROC-AUC) are computed, with support for handling abstentions. The Scorer class also supports custom metric functions and slice-level evaluation.
Usage
Use this principle after generating probabilistic or discrete labels from a trained label model. Evaluate label quality to validate that the weak supervision approach produces acceptable label accuracy before committing to downstream model training.
Theoretical Basis
Given gold labels and predictions :
Accuracy:
Coverage (fraction of non-abstain predictions):
For probabilistic predictions, metrics like ROC-AUC operate on the probability distribution directly. The probs_to_preds utility converts probabilities to hard predictions with configurable tie-breaking.