Principle:Cleanlab Cleanlab Coteaching Algorithm

Knowledge Sources	Cleanlab
Domains	Deep Learning, Noisy Labels, Robust Training
Last Updated	2026-02-09 00:00 GMT

Overview

Co-Teaching is a training paradigm where two neural networks simultaneously learn from noisy data by cross-selecting clean samples for each other, exploiting the observation that different networks tend to memorize different noisy examples.

Description

The Co-Teaching algorithm, introduced by Han et al. (2018), addresses the challenge of training deep neural networks when a significant fraction of training labels are incorrect. Deep networks have sufficient capacity to memorize noisy labels, which degrades generalization. Co-Teaching combats this by training two networks in tandem: in each mini-batch, each network computes the loss on all examples, selects the examples with the smallest loss values (those most likely correctly labeled), and passes only those selected examples to the other network for parameter updates.

The key insight is that two independently initialized networks will learn to fit clean data first (due to the simplicity bias of gradient descent) and will memorize different noisy examples. By having each network select training examples for the other, the cross-selection filters out examples that either network finds suspicious, providing a stronger denoising effect than self-selection alone.

Usage

Co-Teaching is the right choice when:

You want to train a model directly on noisy data rather than first cleaning the dataset.
The noise rate in labels is known or can be estimated (used to set the forget rate).
You have sufficient GPU resources to train two models simultaneously.
You need an end-to-end training approach that does not require a separate label-cleaning preprocessing step.

It is an alternative to cleanlab's primary approach of identifying and removing label issues before training.

Theoretical Basis

Core Mechanism: Small-Loss Selection

The foundation of Co-Teaching rests on the memorization effect of deep neural networks: networks tend to learn simple, generalizable patterns (clean data) before memorizing complex, noisy patterns. In early training, examples with small loss are more likely to be correctly labeled.

For each mini-batch, the co-teaching loss operates as follows:

Compute per-example cross-entropy loss for both models independently.
Sort examples by loss in ascending order for each model.
Select the top R(t) fraction of examples with the smallest loss, where R(t) = 1 - forget_rate(t) is the remember rate at epoch t.
Cross-update: Model 1 trains on the examples selected by Model 2, and vice versa.

Mathematically, for models with parameters $θ_{1}$ and $θ_{2}$ :

D_1 = argmin_{|D| = R(t)*|B|} sum_{x in D} L(x; theta_2) (selected by model 2)

D_2 = argmin_{|D| = R(t)*|B|} sum_{x in D} L(x; theta_1) (selected by model 1)

theta_1 <- theta_1 - lr * gradient(sum L(x; theta_1) for x in D_1)

theta_2 <- theta_2 - lr * gradient(sum L(x; theta_2) for x in D_2)

Forget Rate Scheduling

The forget rate is not fixed but gradually increases over training following a curriculum:

forget_rate(t) = min(t / T_k, 1) * tau^c

where T_k is the number of gradual warmup epochs, tau is the target forget rate, and c is an exponent controlling the schedule shape. This gradual increase allows the models to learn from all data in early epochs (when memorization has not yet occurred) and progressively filter more aggressively as training continues.

Why Cross-Selection Works

If a single network were to both select and train on its own small-loss examples (self-selection), it would develop a confirmation bias: once an example is memorized, it would consistently appear as low-loss and never be filtered. Cross-selection breaks this feedback loop because the two networks, initialized differently, develop different memorization patterns. An example memorized by one network may still show high loss in the other, allowing it to be filtered out.

Related Pages

Implementation:Cleanlab_Cleanlab_Coteaching_Train

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment