Principle:Snorkel team Snorkel Label Quality Evaluation

Knowledge Sources	Data Programming: Creating Large Training Sets Quickly Snorkel Intro Tutorial
Domains	Evaluation, Weak_Supervision, Metrics
Last Updated	2026-02-14 20:00 GMT

Overview

A methodology for evaluating the quality of programmatically generated labels by computing classification metrics against a gold-labeled development set.

Description

Label Quality Evaluation measures how well the labels produced by the label model (or baseline models) match ground truth labels on a held-out development set. This is the final validation step in the weak supervision pipeline before using the generated labels for downstream model training.

Evaluation compares predictions from:

Label Model: Learned combination of LF votes
Majority Label Voter: Simple majority vote baseline
Random Voter: Uniform random baseline

Standard classification metrics (accuracy, F1, precision, recall, ROC-AUC) are computed, with support for handling abstentions. The Scorer class also supports custom metric functions and slice-level evaluation.

Usage

Use this principle after generating probabilistic or discrete labels from a trained label model. Evaluate label quality to validate that the weak supervision approach produces acceptable label accuracy before committing to downstream model training.

Theoretical Basis

Given gold labels $Y$ and predictions $\hat{Y}$ :

Accuracy: $accuracy = \frac{| {i : {\hat{Y}}_{i} = Y_{i}} |}{n}$

Coverage (fraction of non-abstain predictions): $coverage = \frac{| {i : {\hat{Y}}_{i} \neq - 1} |}{n}$

For probabilistic predictions, metrics like ROC-AUC operate on the probability distribution directly. The probs_to_preds utility converts probabilities to hard predictions with configurable tie-breaking.

Related Pages

Implemented By

Implementation:Snorkel_team_Snorkel_Scorer_Score

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment