Workflow:Snorkel team Snorkel Weak Supervision Pipeline
| Knowledge Sources | |
|---|---|
| Domains | Weak_Supervision, Data_Labeling, Machine_Learning |
| Last Updated | 2026-02-14 20:00 GMT |
Overview
End-to-end process for programmatically labeling training data using labeling functions and a generative label model to produce high-quality probabilistic labels without manual annotation.
Description
This workflow implements the core Snorkel weak supervision paradigm. Users define heuristic labeling functions (LFs) that encode domain knowledge as simple Python functions. These LFs are applied to unlabeled data to produce a label matrix where each LF votes on each data point (or abstains). The LabelModel, a generative model based on matrix completion over a clique tree structure, learns the accuracy parameters of each LF by analyzing their agreements and disagreements. It then combines the noisy LF votes into high-quality probabilistic labels that can be used to train any downstream discriminative model.
The workflow covers the full pipeline from LF definition through label model training, including analysis of LF quality via coverage, overlap, and conflict statistics.
Usage
Execute this workflow when you have unlabeled data and domain experts who can express labeling heuristics as simple rules, patterns, or functions. This is the right approach when manual labeling is too expensive or slow, but you can identify noisy signals such as keyword patterns, regular expressions, distant supervision from knowledge bases, or heuristic rules that correlate with the target label. The output is a set of probabilistic training labels suitable for training any supervised model.
Execution Steps
Step 1: Define Labeling Functions
Create a set of labeling functions that encode domain knowledge as heuristics. Each LF is a Python function that takes a data point and returns either an integer class label or -1 to abstain. LFs can use the decorator syntax for simple functions or be instantiated as LabelingFunction objects for more complex logic requiring resources or preprocessors.
Key considerations:
- Each LF should target a specific signal or pattern in the data
- LFs should have meaningful names for analysis and debugging
- Use the abstain mechanism (-1) when the LF has no opinion on a data point
- Attach preprocessors (e.g., SpacyPreprocessor for NLP tasks) via the pre parameter
- External resources like dictionaries or lookup tables are passed via the resources parameter
Step 2: Apply Labeling Functions
Execute all labeling functions across the dataset to produce a label matrix L of shape [n_datapoints, n_lfs]. Each cell L[i,j] contains the label that LF j assigned to data point i, or -1 if it abstained. Choose the appropriate applier based on your data backend: PandasLFApplier for in-memory DataFrames, DaskLFApplier for distributed Dask DataFrames, or SparkLFApplier for PySpark DataFrames.
Key considerations:
- PandasLFApplier is single-process and suitable for datasets that fit in memory
- DaskLFApplier and SparkLFApplier enable distributed execution for large datasets
- Enable fault_tolerant mode to handle LF execution errors gracefully
- The output label matrix uses sparse representation internally for efficiency
Step 3: Analyze Labeling Function Quality
Use LFAnalysis to compute summary statistics about LF performance before training the label model. This step produces a DataFrame with per-LF metrics including coverage (fraction of data points labeled), overlap (fraction labeled by multiple LFs), conflict (fraction where LFs disagree), and optionally empirical accuracy against a gold development set.
Key considerations:
- Low coverage LFs may need broader heuristics or should be removed
- High conflict rates between LFs indicate disagreement that the label model must resolve
- If a gold development set is available, compute empirical accuracy to identify weak LFs
- Use the lf_summary method for a comprehensive per-LF statistics table
Step 4: Train the Label Model
Instantiate and train the LabelModel on the label matrix. The model learns the conditional probabilities P(LF | Y) for each labeling function by analyzing the statistical dependencies in the label matrix using a matrix completion approach over a clique tree. Training uses SGD (or Adam/Adamax) to optimize these parameters.
Key considerations:
- Set cardinality to match the number of classes in the classification task
- The default training uses 100 epochs with SGD; adjust n_epochs and lr as needed
- The l2 parameter controls regularization strength
- Use prec_init to set prior beliefs about LF precision (default 0.7)
- The model assumes conditional independence of LFs given Y by default
Step 5: Generate Probabilistic Labels
Use the trained LabelModel to produce probabilistic labels for the dataset. The predict_proba method returns a matrix of class probabilities P(Y | LFs) for each data point. The predict method returns hard labels by taking the argmax of the probabilities, with an optional tie_break_policy for handling ties.
Key considerations:
- predict_proba returns soft labels suitable for training with cross-entropy loss
- predict returns hard integer labels using argmax
- Filter out data points where no LFs voted using the filter_unlabeled_dataframe utility
- The probabilistic labels can be used directly with Snorkel's cross_entropy_with_probs loss function for end model training
Step 6: Evaluate Label Quality
Assess the quality of the generated labels by comparing against a gold validation set using the Scorer class. Compute metrics such as accuracy, F1 score, and ROC-AUC to determine if the label model output is suitable for downstream training.
Key considerations:
- Use probs_to_preds to convert probabilistic labels to hard predictions for metric computation
- The Scorer supports multiple metrics computed simultaneously
- Compare label model performance against baseline approaches (MajorityLabelVoter, RandomVoter)
- Iterate on LF design based on evaluation results