Heuristic:Fastai Fastbook Mixup Data Augmentation
| Knowledge Sources | |
|---|---|
| Domains | Data_Augmentation, Computer_Vision |
| Last Updated | 2026-02-09 17:00 GMT |
Overview
Use the Mixup callback (`cbs=Mixup`) for training from scratch or when limited data is available; requires more epochs but reduces overfitting.
Description
Mixup is a data augmentation technique that creates virtual training examples by taking weighted linear combinations of pairs of training inputs and their labels. For each training step, a random image pair is selected and combined: `new_image = t * image1 + (1-t) * image2` with `new_target = t * target1 + (1-t) * target2`, where `t` is a random weight between 0.5 and 1.0. This requires one-hot encoded targets since labels become fractional.
Mixup provides a form of augmentation that is "tunable" in intensity (unlike binary augmentations like flipping) and works across different data modalities, including NLP via activation-level mixing.
Usage
Use Mixup when:
- Training from scratch (no pretrained model): Mixup provides strong regularization
- Limited training data: Prevents overfitting by creating unlimited virtual examples
- Long training runs (80+ epochs): Mixup results improve with more training; for short runs, standard augmentation may be better
- Overfitting despite other augmentation: Mixup addresses the problem that traditional augmentations are fixed transforms
Do not use for short training runs (< 80 epochs) where standard augmentation already works well.
The Insight (Rule of Thumb)
- Action: Add `cbs=Mixup` when creating the `Learner`.
- Value: No hyperparameters needed for basic usage. The mixing weight `t` is automatically sampled from a Beta distribution.
- Trade-off: Training is harder (model must predict two labels with weights), so it requires more epochs. The Imagenette leaderboard shows Mixup wins at 80+ epochs but loses at fewer epochs.
Reasoning
Mixup elegantly solves several problems simultaneously:
- Continuous augmentation: Unlike flipping or rotation, Mixup intensity is continuously variable.
- Label smoothing effect: Targets are no longer hard 0/1 values, preventing the model from becoming over-confident.
- Unlimited variety: Every pair combination creates a unique training example, making it nearly impossible to memorize the training set.
- Modality-agnostic: Can be applied to images, text (via activation mixing), tabular data, etc.
The mathematical formulation is simple: `(x_new, y_new) = (λ*x_i + (1-λ)*x_j, λ*y_i + (1-λ)*y_j)` where λ is sampled from Beta(α, α).
Code Evidence
Mixup usage from `07_sizing_and_tta.md:283-286`:
model = xresnet50()
learn = Learner(dls, model, loss_func=CrossEntropyLossFlat(),
metrics=accuracy, cbs=Mixup)
learn.fit_one_cycle(5, 3e-3)
Mixup algorithm pseudocode from `07_sizing_and_tta.md:247-251`:
image2,target2 = dataset[randint(0,len(dataset)]
t = random_float(0.5,1.0)
new_image = t * image1 + (1-t) * image2
new_target = t * target1 + (1-t) * target2