Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Cleanlab Cleanlab Synthetic Noise Generation

From Leeroopedia


Knowledge Sources
Domains Data Quality, Benchmarking, Statistical Modeling
Last Updated 2026-02-09 00:00 GMT

Overview

Synthetic noise generation is the process of creating controlled, artificial label noise in classification datasets to enable systematic benchmarking and evaluation of noise-robust learning algorithms.

Description

In real-world classification tasks, labels are often noisy due to annotator errors, ambiguous examples, or systematic biases. To evaluate algorithms that detect or correct such noise, researchers need datasets with known noise characteristics. Synthetic noise generation provides this by constructing a noise matrix (also called a label transition matrix) that specifies the conditional probability of observing each noisy label given each true label, and then applying this matrix to flip clean labels accordingly.

The noise matrix is a K x K column-stochastic matrix where entry (i, j) represents P(observed_label = i | true_label = j). The diagonal entries represent the probability of a label remaining correct, and off-diagonal entries represent the probability of mislabeling. The trace of this matrix (sum of diagonal entries) serves as a single scalar summary of overall noise level: a trace of K means no noise (perfect labels), while a trace approaching 1 represents nearly random labels.

A critical property for any generated noise matrix is learnability: the condition that it must be possible to achieve better-than-random classification performance despite the noise. Without this property, the noise is so severe or structured that no algorithm can learn from the data.

Usage

Synthetic noise generation is the right approach when you need to evaluate label error detection algorithms on data with known ground truth, when benchmarking the robustness of machine learning models to varying levels of label noise, or when conducting controlled experiments on the effects of specific noise structures (e.g., confusion between particular class pairs).

Theoretical Basis

Noise Matrix Definition

The noise matrix N is defined as:

N[i][j] = P(observed_label = i | true_label = j)

Each column of N sums to 1 (column-stochastic property), and the matrix has shape K x K where K is the number of classes.

Learnability Condition

A noise matrix N is learnable (valid for learning with noisy labels) if and only if for every class k:

P(true_label = k) * P(observed_label = k) < P(true_label = k, observed_label = k)

Equivalently, this requires that the diagonal entries of the joint distribution matrix be large enough relative to the marginals. When this condition holds, the true class labels can be recovered at better-than-random accuracy despite the noise.

Trace-Based Generation

Given a desired trace value T (where 1 < T <= K), the generation algorithm:

  1. Samples K diagonal probabilities that sum to T using constrained Dirichlet sampling with iterative clamping to enforce min/max bounds.
  2. Distributes off-diagonal noise rates for each column such that each column sums to 1, optionally enforcing a fraction of zero noise rates for sparse noise patterns.
  3. Validates the resulting matrix against the learnability condition using the class prior distribution py.
  4. Iterates up to max_iter times until a valid matrix is found.

Label Flipping

Given a valid noise matrix N and clean labels, noisy labels are produced by computing the expected number of label flips per class pair from the joint distribution:

count_joint[i][j] = round(N[i][j] * P(true_label = j) * N_total)

For each true class k, randomly selected examples with true label k are reassigned to noisy labels according to these counts, ensuring the empirical noise matrix closely matches the target.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment