Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Snorkel team Snorkel Generate Simple Label Matrix

From Leeroopedia
Knowledge Sources
Domains Weak_Supervision, Testing, Data_Generation
Last Updated 2026-02-14 20:38 GMT

Overview

Concrete tool for generating synthetic label matrices with known ground-truth parameters for testing and benchmarking weak supervision algorithms.

Description

The generate_simple_label_matrix function creates a complete synthetic weak supervision scenario: labeling function conditional probability tables, true labels, and the resulting label matrix. It models each labeling function as a noisy voter with a conditional probability table P(LF=l | Y=y), biases LFs towards being non-adversarial (correct more often than wrong), and supports configurable abstain rates. This is the primary testing utility for validating label model convergence and correctness without requiring real labeling functions or data.

Usage

Import this function when you need synthetic data for unit testing label model algorithms, benchmarking weak supervision approaches, or running reproducibility experiments. It is used extensively in Snorkel's own test suite to validate that the LabelModel correctly recovers latent accuracy parameters.

Code Reference

Source Location

Signature

def generate_simple_label_matrix(
    n: int, m: int, cardinality: int, abstain_multiplier: float = 1.0
) -> Tuple[np.ndarray, np.ndarray, np.ndarray]:
    """Generate a synthetic label matrix with true parameters and labels.

    This function generates a set of labeling function conditional probability tables,
    P(LF=l | Y=y), stored as a matrix P, and true labels Y, and then generates the
    resulting label matrix L.

    Parameters
    ----------
    n
        Number of data points
    m
        Number of labeling functions
    cardinality
        Cardinality of true labels (i.e. not including abstains)
    abstain_multiplier
        Factor to multiply the probability of abstaining by

    Returns
    -------
    Tuple[np.ndarray, np.ndarray, np.ndarray]
        A tuple containing the LF conditional probabilities P,
        the true labels Y, and the output label matrix L
    """

Import

from snorkel.synthetic.synthetic_data import generate_simple_label_matrix

I/O Contract

Inputs

Name Type Required Description
n int Yes Number of data points to generate
m int Yes Number of labeling functions to simulate
cardinality int Yes Number of true label classes (excluding abstain)
abstain_multiplier float No Multiplier for abstain probability (default 1.0); higher values produce sparser label matrices

Outputs

Name Type Description
P np.ndarray (m, cardinality+1, cardinality) Y = y), where l=0 is abstain
Y np.ndarray (n,) True labels for each data point, sampled uniformly from [0, cardinality)
L np.ndarray (n, m) Label matrix where L[i, j] is the label assigned by LF j to data point i; -1 indicates abstain

Usage Examples

Basic Synthetic Label Matrix

from snorkel.synthetic.synthetic_data import generate_simple_label_matrix

# Generate a binary classification scenario
# 1000 data points, 10 labeling functions, 2 classes
P, Y, L = generate_simple_label_matrix(n=1000, m=10, cardinality=2)

print(f"Conditional probabilities shape: {P.shape}")  # (10, 3, 2)
print(f"True labels shape: {Y.shape}")                 # (1000,)
print(f"Label matrix shape: {L.shape}")                # (1000, 10)
print(f"Abstain rate: {(L == -1).mean():.2f}")

Testing LabelModel Convergence

import numpy as np
from snorkel.synthetic.synthetic_data import generate_simple_label_matrix
from snorkel.labeling.model import LabelModel

# Generate sparse label matrix (high abstain rate)
P, Y, L = generate_simple_label_matrix(
    n=5000, m=10, cardinality=2, abstain_multiplier=3.0
)

# Train label model on synthetic data
label_model = LabelModel(cardinality=2)
label_model.fit(L, n_epochs=500)

# Evaluate against known ground truth
predictions = label_model.predict(L)
accuracy = (predictions == Y).mean()
print(f"Label model accuracy: {accuracy:.3f}")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment