Heuristic:Snorkel team Snorkel LabelModel Mu Eps Clamping

Knowledge Sources	Snorkel Issue 1422 mu_eps in sparse settings
Domains	Weak_Supervision, Optimization
Last Updated	2026-02-14 21:00 GMT

Overview

Parameter clamping heuristic for the LabelModel that bounds learned conditional probabilities using a data-size-dependent epsilon, preventing degenerate solutions in sparse labeling settings.

Description

During LabelModel training, the learned $μ$ parameters (conditional probabilities of LF outputs given the true label) are clamped to the range `[mu_eps, 1 - mu_eps]` after each gradient step. The default mu_eps is computed dynamically as `min(0.01, 1 / 10^ceil(log10(n)))` where `n` is the number of data points. This prevents the optimizer from pushing probabilities to exact 0 or 1, which would cause numerical instability and degenerate solutions.

Usage

Apply this heuristic when training the LabelModel, especially in sparse labeling settings where most LFs abstain on most data points. If you observe that all learned conditional probabilities converge to the same value (all equal to mu_eps or 1 - mu_eps), your mu_eps is likely too high. See GitHub issue #1422 for details.

The Insight (Rule of Thumb)

Action: Set `mu_eps` in LabelModel `fit()` kwargs when the default produces degenerate results.
Value: Default is `min(0.01, 1 / 10^ceil(log10(n)))`. For 100 data points: 0.01. For 1,000: 0.001. For 10,000: 0.0001. The 0.01 cap is a hard-coded floor.
Trade-off: Setting mu_eps too high forces all learned probabilities toward the boundary values. Setting it too low risks numerical instability.
Diagnostic: If `get_weights()` returns nearly identical values for all LFs after training, mu_eps is likely too aggressive.

Reasoning

The mu parameters represent P(LF=y|Y=y) -- the conditional probability that a labeling function outputs label y given the true label is y. Without clamping, gradient descent can push these to exact 0 or 1, causing:

Log-probability computations to produce `-inf` or `NaN`
The Munkres alignment algorithm to fail
Degenerate solutions where all LFs appear equally accurate

The data-size-dependent formula `1 / 10^ceil(log10(n))` makes the epsilon inversely proportional to the order of magnitude of the dataset size. The rounding to powers of 10 is intentional -- it makes it "more obvious when the parameters have been clamped" (from code comment). The 0.01 cap prevents the epsilon from being too large on small datasets.

Code evidence from `label_model.py:742-761`:

    def _clamp_params(self) -> None:
        """Clamp the values of the learned parameter vector."""
        if self.train_config.mu_eps is not None:
            mu_eps = self.train_config.mu_eps
        else:
            mu_eps = min(0.01, 1 / 10 ** np.ceil(np.log10(self.n)))
        self.mu.data = self.mu.clamp(mu_eps, 1 - mu_eps)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment