Principle:Snorkel team Snorkel Synthetic Label Matrix Generation
| Knowledge Sources | |
|---|---|
| Domains | Weak_Supervision, Testing, Data_Generation |
| Last Updated | 2026-02-14 20:38 GMT |
Overview
Technique for generating synthetic label matrices from parameterized labeling function accuracy models to enable controlled testing of weak supervision algorithms.
Description
Synthetic Label Matrix Generation creates a complete simulated weak supervision environment by sampling from a generative model of labeling functions. Rather than requiring real labeling functions and data, this technique defines each LF through a conditional probability table P(LF=l | Y=y) and generates label matrices by ancestral sampling. The generated data has known ground-truth parameters (LF accuracies, true labels), enabling precise validation of whether a label model can recover these parameters. This is essential for testing the correctness and convergence properties of generative label models like Snorkel's LabelModel.
Usage
Use this principle when designing test suites or benchmarks for weak supervision algorithms. It is the correct approach when you need to verify that a label model correctly estimates LF accuracies, when you want to test edge cases (high abstain rates, adversarial LFs, multi-class settings), or when you need reproducible experiments without real-world data dependencies.
Theoretical Basis
The generative model follows the data programming framework:
Where is the conditional probability table for labeling function , with dimensions (cardinality + 1) x cardinality, where the extra dimension accounts for abstains.
Key properties of the generation process:
- Non-adversarial bias: Diagonal entries of P are boosted by (cardinality - 1), ensuring LFs are more likely to output the correct label than any specific incorrect label.
- Configurable sparsity: The abstain probability row is scaled by an
abstain_multiplier, simulating the common real-world pattern where LFs label sparsely. - Balanced classes: True labels Y are sampled uniformly, providing a controlled baseline.
Pseudo-code Logic:
# Abstract generation algorithm
for each LF j:
P[j] = random_table(cardinality + 1, cardinality)
P[j][1:, :] += (cardinality - 1) * identity # non-adversarial bias
P[j][0, :] *= abstain_multiplier # abstain scaling
P[j] = normalize_columns(P[j])
Y = uniform_sample(cardinality, n)
for each data point i, LF j:
L[i, j] = sample(P[j][:, Y[i]]) - 1 # shift so -1 = abstain