Implementation:Cleanlab Cleanlab Generate Noise Matrix
| Knowledge Sources | |
|---|---|
| Domains | Data Quality, Benchmarking, Machine Learning |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Provides utilities for generating synthetic noise matrices and noisy labels to benchmark cleanlab's label error detection algorithms.
Description
The noise_generation module in cleanlab's benchmarking package supplies a suite of functions for creating controlled label noise in classification datasets. The primary function, generate_noise_matrix_from_trace, generates a K x K conditional probability matrix P(label=k_s|true_label=k_y) with a specified trace (sum of diagonal entries), which controls the overall noise level. Supporting functions include noise_matrix_is_valid for checking whether a noise matrix satisfies learnability conditions, generate_noisy_labels for flipping clean labels according to a noise matrix, generate_n_rand_probabilities_that_sum_to_m for constrained Dirichlet sampling, and randomly_distribute_N_balls_into_K_bins for distributing integer counts across bins with min/max constraints.
Usage
Import this module when you need to create synthetic noisy datasets for evaluating label issue detection methods, when benchmarking cleanlab's algorithms under varying noise conditions, or when generating controlled label noise for research experiments on learning with noisy labels.
Code Reference
Source Location
- Repository: Cleanlab
- File: cleanlab/benchmarking/noise_generation.py
- Lines: 1-487
Signature
def generate_noise_matrix_from_trace(
K,
trace,
*,
max_trace_prob=1.0,
min_trace_prob=1e-5,
max_noise_rate=1 - 1e-5,
min_noise_rate=0.0,
valid_noise_matrix=True,
py=None,
frac_zero_noise_rates=0.0,
seed=0,
max_iter=10000,
) -> Optional[np.ndarray]
def generate_noisy_labels(true_labels, noise_matrix) -> np.ndarray
def noise_matrix_is_valid(noise_matrix, py, *, verbose=False) -> bool
Import
from cleanlab.benchmarking.noise_generation import (
generate_noise_matrix_from_trace,
generate_noisy_labels,
noise_matrix_is_valid,
)
I/O Contract
Inputs (generate_noise_matrix_from_trace)
| Name | Type | Required | Description |
|---|---|---|---|
| K | int | Yes | Number of classes. Creates a noise matrix of shape (K, K). Must be >= 2. |
| trace | float | Yes | Desired sum of diagonal entries. Must be > 1 when valid_noise_matrix is True. |
| max_trace_prob | float | No | Maximum probability of any diagonal entry. Default 1.0. |
| min_trace_prob | float | No | Minimum probability of any diagonal entry. Default 1e-5. |
| max_noise_rate | float | No | Maximum off-diagonal noise rate. Default 1 - 1e-5. |
| min_noise_rate | float | No | Minimum off-diagonal noise rate. Default 0.0. |
| valid_noise_matrix | bool | No | If True, ensures the matrix satisfies the learnability condition. Default True. |
| py | np.ndarray | No | Array of shape (K,) with prior probabilities P(true_label=k). Required when valid_noise_matrix is True and K > 2. |
| frac_zero_noise_rates | float | No | Fraction of off-diagonal entries to set to zero. Default 0.0. |
| seed | int | No | Random seed for reproducibility. Default 0. |
| max_iter | int | No | Maximum number of iterations to produce a valid matrix. Default 10000. |
Inputs (generate_noisy_labels)
| Name | Type | Required | Description |
|---|---|---|---|
| true_labels | np.ndarray | Yes | Array of shape (N,) with clean integer labels in 0, 1, ..., K-1. |
| noise_matrix | np.ndarray | Yes | true_label=k_y). Columns must sum to 1. |
Outputs
| Name | Type | Description |
|---|---|---|
| noise_matrix | np.ndarray or None | For generate_noise_matrix_from_trace: a (K, K) noise matrix with the specified trace, or None if max_iter is exceeded. |
| labels | np.ndarray | For generate_noisy_labels: a (N,) array of noisy labels produced by flipping clean labels according to the noise matrix. |
Usage Examples
Basic Usage: Generate a Noise Matrix and Noisy Labels
import numpy as np
from cleanlab.benchmarking.noise_generation import (
generate_noise_matrix_from_trace,
generate_noisy_labels,
noise_matrix_is_valid,
)
# Define 3-class problem with known class priors
K = 3
py = np.array([0.4, 0.35, 0.25])
# Generate a noise matrix with trace=2.1 (moderate noise)
noise_matrix = generate_noise_matrix_from_trace(
K=K,
trace=2.1,
py=py,
valid_noise_matrix=True,
seed=42,
)
# Verify the noise matrix is learnable
is_valid = noise_matrix_is_valid(noise_matrix, py)
print(f"Noise matrix valid: {is_valid}")
# Create synthetic clean labels
true_labels = np.random.choice(K, size=10000, p=py)
# Generate noisy labels
noisy_labels = generate_noisy_labels(true_labels, noise_matrix)
print(f"Fraction of flipped labels: {np.mean(true_labels != noisy_labels):.3f}")