Principle:Cleanlab Cleanlab Coteaching Algorithm
| Knowledge Sources | |
|---|---|
| Domains | Deep Learning, Noisy Labels, Robust Training |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Co-Teaching is a training paradigm where two neural networks simultaneously learn from noisy data by cross-selecting clean samples for each other, exploiting the observation that different networks tend to memorize different noisy examples.
Description
The Co-Teaching algorithm, introduced by Han et al. (2018), addresses the challenge of training deep neural networks when a significant fraction of training labels are incorrect. Deep networks have sufficient capacity to memorize noisy labels, which degrades generalization. Co-Teaching combats this by training two networks in tandem: in each mini-batch, each network computes the loss on all examples, selects the examples with the smallest loss values (those most likely correctly labeled), and passes only those selected examples to the other network for parameter updates.
The key insight is that two independently initialized networks will learn to fit clean data first (due to the simplicity bias of gradient descent) and will memorize different noisy examples. By having each network select training examples for the other, the cross-selection filters out examples that either network finds suspicious, providing a stronger denoising effect than self-selection alone.
Usage
Co-Teaching is the right choice when:
- You want to train a model directly on noisy data rather than first cleaning the dataset.
- The noise rate in labels is known or can be estimated (used to set the forget rate).
- You have sufficient GPU resources to train two models simultaneously.
- You need an end-to-end training approach that does not require a separate label-cleaning preprocessing step.
It is an alternative to cleanlab's primary approach of identifying and removing label issues before training.
Theoretical Basis
Core Mechanism: Small-Loss Selection
The foundation of Co-Teaching rests on the memorization effect of deep neural networks: networks tend to learn simple, generalizable patterns (clean data) before memorizing complex, noisy patterns. In early training, examples with small loss are more likely to be correctly labeled.
For each mini-batch, the co-teaching loss operates as follows:
- Compute per-example cross-entropy loss for both models independently.
- Sort examples by loss in ascending order for each model.
- Select the top
R(t)fraction of examples with the smallest loss, whereR(t) = 1 - forget_rate(t)is the remember rate at epocht. - Cross-update: Model 1 trains on the examples selected by Model 2, and vice versa.
Mathematically, for models with parameters and :
D_1 = argmin_{|D| = R(t)*|B|} sum_{x in D} L(x; theta_2) (selected by model 2)
D_2 = argmin_{|D| = R(t)*|B|} sum_{x in D} L(x; theta_1) (selected by model 1)
theta_1 <- theta_1 - lr * gradient(sum L(x; theta_1) for x in D_1)
theta_2 <- theta_2 - lr * gradient(sum L(x; theta_2) for x in D_2)
Forget Rate Scheduling
The forget rate is not fixed but gradually increases over training following a curriculum:
forget_rate(t) = min(t / T_k, 1) * tau^c
where T_k is the number of gradual warmup epochs, tau is the target forget rate, and c is an exponent controlling the schedule shape. This gradual increase allows the models to learn from all data in early epochs (when memorization has not yet occurred) and progressively filter more aggressively as training continues.
Why Cross-Selection Works
If a single network were to both select and train on its own small-loss examples (self-selection), it would develop a confirmation bias: once an example is memorized, it would consistently appear as low-loss and never be filtered. Cross-selection breaks this feedback loop because the two networks, initialized differently, develop different memorization patterns. An example memorized by one network may still show high loss in the other, allowing it to be filtered out.