Principle:Snorkel team Snorkel Probabilistic Label Generation
| Knowledge Sources | |
|---|---|
| Domains | Weak_Supervision, Probabilistic_Inference |
| Last Updated | 2026-02-14 20:00 GMT |
Overview
A method for generating probabilistic (soft) or discrete (hard) labels from a trained label model by marginalizing over the learned LF accuracy parameters.
Description
Probabilistic Label Generation is the inference step of the data programming pipeline. After training the label model to learn LF accuracies, this step uses those learned parameters to produce labels for each data point. The output can be:
- Probabilistic labels: A probability distribution over classes for each data point, capturing uncertainty in the labeling
- Discrete labels: Hard label assignments obtained by taking the argmax of the probabilities, with configurable tie-breaking policies
Probabilistic labels are particularly valuable because they preserve uncertainty information that can be propagated to downstream model training via noise-aware loss functions (e.g., cross-entropy with soft targets).
Usage
Use this principle after training a label model. Generate probabilistic labels when training a downstream model that supports soft labels. Generate discrete labels when you need hard assignments for standard supervised learning or evaluation.
Theoretical Basis
Given trained parameters and a new label matrix , the posterior probability of the true label is:
For discrete predictions with tie-breaking:
- Abstain: Return -1 if max probabilities are tied
- Random: Break ties deterministically using a hash function
- True-random: Break ties with genuine randomness