Principle:Snorkel team Snorkel Generative Label Model Training
| Knowledge Sources | |
|---|---|
| Domains | Weak_Supervision, Graphical_Models, Matrix_Completion |
| Last Updated | 2026-02-14 20:00 GMT |
Overview
An algorithm that learns the accuracy parameters of noisy labeling functions from their agreement and disagreement patterns, without access to ground truth labels.
Description
Generative Label Model Training is the core algorithmic step in the data programming paradigm. Given a label matrix produced by multiple noisy labeling functions, the label model learns the conditional probability of each LF's output given the true (unobserved) label: .
The key insight is that the agreement and disagreement patterns among labeling functions provide sufficient statistics to estimate their individual accuracies. This is possible because the LFs are assumed to be conditionally independent given the true label Y (or have a known dependency structure).
The Snorkel label model uses a matrix completion approach over the junction tree of the LF dependency graph. It computes the inverse generalized covariance matrix and performs optimization to recover the conditional LF probability parameters.
Training involves:
- Computing the augmented label matrix (one-hot encoded votes)
- Building a clique tree for the dependency structure
- Optimizing a noise-aware loss function using SGD/Adam
- Optionally aligning label classes using the Munkres algorithm
Usage
Use this principle after applying labeling functions and analyzing their quality. Train the label model when you have a sufficient set of labeling functions (typically 3+ with reasonable coverage) and want to combine their noisy votes into high-quality probabilistic labels.
Theoretical Basis
The generative model defines the joint distribution:
under the conditional independence assumption. The label model parameters encode:
Training minimizes the negative log marginal likelihood of the observed label matrix:
with optional L2 regularization and LF precision priors.
Pseudo-code:
# Abstract label model training
mu = initialize_parameters(n_lfs, cardinality, prec_init=0.7)
for epoch in range(n_epochs):
L_aug = augment_label_matrix(L_train)
loss = compute_loss(L_aug, mu) + l2 * regularization(mu)
mu = optimizer_step(mu, loss)