Implementation:Snorkel team Snorkel Generate Simple Label Matrix
| Knowledge Sources | |
|---|---|
| Domains | Weak_Supervision, Testing, Data_Generation |
| Last Updated | 2026-02-14 20:38 GMT |
Overview
Concrete tool for generating synthetic label matrices with known ground-truth parameters for testing and benchmarking weak supervision algorithms.
Description
The generate_simple_label_matrix function creates a complete synthetic weak supervision scenario: labeling function conditional probability tables, true labels, and the resulting label matrix. It models each labeling function as a noisy voter with a conditional probability table P(LF=l | Y=y), biases LFs towards being non-adversarial (correct more often than wrong), and supports configurable abstain rates. This is the primary testing utility for validating label model convergence and correctness without requiring real labeling functions or data.
Usage
Import this function when you need synthetic data for unit testing label model algorithms, benchmarking weak supervision approaches, or running reproducibility experiments. It is used extensively in Snorkel's own test suite to validate that the LabelModel correctly recovers latent accuracy parameters.
Code Reference
Source Location
- Repository: Snorkel
- File: snorkel/synthetic/synthetic_data.py
- Lines: 1-59
Signature
def generate_simple_label_matrix(
n: int, m: int, cardinality: int, abstain_multiplier: float = 1.0
) -> Tuple[np.ndarray, np.ndarray, np.ndarray]:
"""Generate a synthetic label matrix with true parameters and labels.
This function generates a set of labeling function conditional probability tables,
P(LF=l | Y=y), stored as a matrix P, and true labels Y, and then generates the
resulting label matrix L.
Parameters
----------
n
Number of data points
m
Number of labeling functions
cardinality
Cardinality of true labels (i.e. not including abstains)
abstain_multiplier
Factor to multiply the probability of abstaining by
Returns
-------
Tuple[np.ndarray, np.ndarray, np.ndarray]
A tuple containing the LF conditional probabilities P,
the true labels Y, and the output label matrix L
"""
Import
from snorkel.synthetic.synthetic_data import generate_simple_label_matrix
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| n | int | Yes | Number of data points to generate |
| m | int | Yes | Number of labeling functions to simulate |
| cardinality | int | Yes | Number of true label classes (excluding abstain) |
| abstain_multiplier | float | No | Multiplier for abstain probability (default 1.0); higher values produce sparser label matrices |
Outputs
| Name | Type | Description |
|---|---|---|
| P | np.ndarray (m, cardinality+1, cardinality) | Y = y), where l=0 is abstain |
| Y | np.ndarray (n,) | True labels for each data point, sampled uniformly from [0, cardinality) |
| L | np.ndarray (n, m) | Label matrix where L[i, j] is the label assigned by LF j to data point i; -1 indicates abstain |
Usage Examples
Basic Synthetic Label Matrix
from snorkel.synthetic.synthetic_data import generate_simple_label_matrix
# Generate a binary classification scenario
# 1000 data points, 10 labeling functions, 2 classes
P, Y, L = generate_simple_label_matrix(n=1000, m=10, cardinality=2)
print(f"Conditional probabilities shape: {P.shape}") # (10, 3, 2)
print(f"True labels shape: {Y.shape}") # (1000,)
print(f"Label matrix shape: {L.shape}") # (1000, 10)
print(f"Abstain rate: {(L == -1).mean():.2f}")
Testing LabelModel Convergence
import numpy as np
from snorkel.synthetic.synthetic_data import generate_simple_label_matrix
from snorkel.labeling.model import LabelModel
# Generate sparse label matrix (high abstain rate)
P, Y, L = generate_simple_label_matrix(
n=5000, m=10, cardinality=2, abstain_multiplier=3.0
)
# Train label model on synthetic data
label_model = LabelModel(cardinality=2)
label_model.fit(L, n_epochs=500)
# Evaluate against known ground truth
predictions = label_model.predict(L)
accuracy = (predictions == Y).mean()
print(f"Label model accuracy: {accuracy:.3f}")