Principle:Cleanlab Cleanlab Latent Noise Estimation
| Knowledge Sources | |
|---|---|
| Domains | Machine_Learning, Data_Quality |
| Last Updated | 2026-02-09 19:00 GMT |
Overview
Method for estimating latent noise transition matrices from a confident joint, characterizing the systematic patterns of label corruption in a dataset.
Description
Given the confident joint matrix, latent noise estimation derives three key quantities that fully characterize the noise structure of a labeled dataset:
- py -- the latent prior distribution of true labels, representing the actual class proportions before label corruption occurred.
- noise_matrix -- the matrix P(given_label | true_label) describing how true labels get corrupted into noisy observed labels. Entry noise_matrix[i][j] is the probability that an example with true label j receives the noisy given label i.
- inv_noise_matrix -- the inverse noise matrix P(true_label | given_label) describing what the true label likely is given an observed noisy label. Entry inv_noise_matrix[j][i] is the probability that the true label is j given that the observed label is i.
Together, these three quantities provide a complete statistical model of the label corruption process. The noise matrix reveals which classes are systematically confused by annotators, the inverse noise matrix enables correcting for noise at prediction time, and the true prior reveals whether the observed class distribution is distorted by noise.
Usage
Use when you need to understand the systematic structure of label noise in your dataset. This is valuable for diagnosing annotation quality, understanding which classes are most commonly confused, correcting class priors for training, and building noise-aware classifiers. The estimated noise matrices are also used internally by some filtering methods to determine per-class error counts.
Theoretical Basis
Given the calibrated confident joint C of shape (K, K), the three latent quantities are derived as follows:
True label prior (py):
py[j] = sum(C[:, j]) / sum(C)
This estimates the fraction of examples whose true label is class j, computed as the column sum of C normalized by the total count.
Noise matrix P(given_label | true_label):
noise_matrix[i][j] = C[i][j] / sum(C[:, j])
Each column of C is normalized to sum to 1, giving the conditional distribution of given labels for each true label class.
Inverse noise matrix P(true_label | given_label):
inv_noise_matrix[j][i] = C[i][j] / sum(C[i, :])
Each row of C is normalized to sum to 1, giving the conditional distribution of true labels for each given label class.
An optional iterative convergence procedure can refine these estimates by alternating between computing the noise matrices and re-estimating the confident joint until the estimates stabilize. Four methods for estimating py are supported: direct counting ("cnt"), equation-based ("eqn"), and two marginal methods ("marginal", "marginal_ps").