Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Cleanlab Cleanlab Generate Noise Matrix

From Leeroopedia


Knowledge Sources
Domains Data Quality, Benchmarking, Machine Learning
Last Updated 2026-02-09 00:00 GMT

Overview

Provides utilities for generating synthetic noise matrices and noisy labels to benchmark cleanlab's label error detection algorithms.

Description

The noise_generation module in cleanlab's benchmarking package supplies a suite of functions for creating controlled label noise in classification datasets. The primary function, generate_noise_matrix_from_trace, generates a K x K conditional probability matrix P(label=k_s|true_label=k_y) with a specified trace (sum of diagonal entries), which controls the overall noise level. Supporting functions include noise_matrix_is_valid for checking whether a noise matrix satisfies learnability conditions, generate_noisy_labels for flipping clean labels according to a noise matrix, generate_n_rand_probabilities_that_sum_to_m for constrained Dirichlet sampling, and randomly_distribute_N_balls_into_K_bins for distributing integer counts across bins with min/max constraints.

Usage

Import this module when you need to create synthetic noisy datasets for evaluating label issue detection methods, when benchmarking cleanlab's algorithms under varying noise conditions, or when generating controlled label noise for research experiments on learning with noisy labels.

Code Reference

Source Location

  • Repository: Cleanlab
  • File: cleanlab/benchmarking/noise_generation.py
  • Lines: 1-487

Signature

def generate_noise_matrix_from_trace(
    K,
    trace,
    *,
    max_trace_prob=1.0,
    min_trace_prob=1e-5,
    max_noise_rate=1 - 1e-5,
    min_noise_rate=0.0,
    valid_noise_matrix=True,
    py=None,
    frac_zero_noise_rates=0.0,
    seed=0,
    max_iter=10000,
) -> Optional[np.ndarray]
def generate_noisy_labels(true_labels, noise_matrix) -> np.ndarray
def noise_matrix_is_valid(noise_matrix, py, *, verbose=False) -> bool

Import

from cleanlab.benchmarking.noise_generation import (
    generate_noise_matrix_from_trace,
    generate_noisy_labels,
    noise_matrix_is_valid,
)

I/O Contract

Inputs (generate_noise_matrix_from_trace)

Name Type Required Description
K int Yes Number of classes. Creates a noise matrix of shape (K, K). Must be >= 2.
trace float Yes Desired sum of diagonal entries. Must be > 1 when valid_noise_matrix is True.
max_trace_prob float No Maximum probability of any diagonal entry. Default 1.0.
min_trace_prob float No Minimum probability of any diagonal entry. Default 1e-5.
max_noise_rate float No Maximum off-diagonal noise rate. Default 1 - 1e-5.
min_noise_rate float No Minimum off-diagonal noise rate. Default 0.0.
valid_noise_matrix bool No If True, ensures the matrix satisfies the learnability condition. Default True.
py np.ndarray No Array of shape (K,) with prior probabilities P(true_label=k). Required when valid_noise_matrix is True and K > 2.
frac_zero_noise_rates float No Fraction of off-diagonal entries to set to zero. Default 0.0.
seed int No Random seed for reproducibility. Default 0.
max_iter int No Maximum number of iterations to produce a valid matrix. Default 10000.

Inputs (generate_noisy_labels)

Name Type Required Description
true_labels np.ndarray Yes Array of shape (N,) with clean integer labels in 0, 1, ..., K-1.
noise_matrix np.ndarray Yes true_label=k_y). Columns must sum to 1.

Outputs

Name Type Description
noise_matrix np.ndarray or None For generate_noise_matrix_from_trace: a (K, K) noise matrix with the specified trace, or None if max_iter is exceeded.
labels np.ndarray For generate_noisy_labels: a (N,) array of noisy labels produced by flipping clean labels according to the noise matrix.

Usage Examples

Basic Usage: Generate a Noise Matrix and Noisy Labels

import numpy as np
from cleanlab.benchmarking.noise_generation import (
    generate_noise_matrix_from_trace,
    generate_noisy_labels,
    noise_matrix_is_valid,
)

# Define 3-class problem with known class priors
K = 3
py = np.array([0.4, 0.35, 0.25])

# Generate a noise matrix with trace=2.1 (moderate noise)
noise_matrix = generate_noise_matrix_from_trace(
    K=K,
    trace=2.1,
    py=py,
    valid_noise_matrix=True,
    seed=42,
)

# Verify the noise matrix is learnable
is_valid = noise_matrix_is_valid(noise_matrix, py)
print(f"Noise matrix valid: {is_valid}")

# Create synthetic clean labels
true_labels = np.random.choice(K, size=10000, p=py)

# Generate noisy labels
noisy_labels = generate_noisy_labels(true_labels, noise_matrix)
print(f"Fraction of flipped labels: {np.mean(true_labels != noisy_labels):.3f}")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment