Principle:Cleanlab Cleanlab Latent Noise Estimation

Knowledge Sources	Confident Learning Cleanlab
Domains	Machine_Learning, Data_Quality
Last Updated	2026-02-09 19:00 GMT

Overview

Method for estimating latent noise transition matrices from a confident joint, characterizing the systematic patterns of label corruption in a dataset.

Description

Given the confident joint matrix, latent noise estimation derives three key quantities that fully characterize the noise structure of a labeled dataset:

py -- the latent prior distribution of true labels, representing the actual class proportions before label corruption occurred.
noise_matrix -- the matrix P(given_label | true_label) describing how true labels get corrupted into noisy observed labels. Entry noise_matrix[i][j] is the probability that an example with true label j receives the noisy given label i.
inv_noise_matrix -- the inverse noise matrix P(true_label | given_label) describing what the true label likely is given an observed noisy label. Entry inv_noise_matrix[j][i] is the probability that the true label is j given that the observed label is i.

Together, these three quantities provide a complete statistical model of the label corruption process. The noise matrix reveals which classes are systematically confused by annotators, the inverse noise matrix enables correcting for noise at prediction time, and the true prior reveals whether the observed class distribution is distorted by noise.

Usage

Use when you need to understand the systematic structure of label noise in your dataset. This is valuable for diagnosing annotation quality, understanding which classes are most commonly confused, correcting class priors for training, and building noise-aware classifiers. The estimated noise matrices are also used internally by some filtering methods to determine per-class error counts.

Theoretical Basis

Given the calibrated confident joint C of shape (K, K), the three latent quantities are derived as follows:

True label prior (py):

py[j] = sum(C[:, j]) / sum(C)

This estimates the fraction of examples whose true label is class j, computed as the column sum of C normalized by the total count.

Noise matrix P(given_label | true_label):

noise_matrix[i][j] = C[i][j] / sum(C[:, j])

Each column of C is normalized to sum to 1, giving the conditional distribution of given labels for each true label class.

Inverse noise matrix P(true_label | given_label):

inv_noise_matrix[j][i] = C[i][j] / sum(C[i, :])

Each row of C is normalized to sum to 1, giving the conditional distribution of true labels for each given label class.

An optional iterative convergence procedure can refine these estimates by alternating between computing the noise matrices and re-estimating the confident joint until the estimates stabilize. Four methods for estimating py are supported: direct counting ("cnt"), equation-based ("eqn"), and two marginal methods ("marginal", "marginal_ps").

Related Pages

Implemented By

Implementation:Cleanlab_Cleanlab_Estimate_Latent

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment