Heuristic:Fastai Fastbook Mixup Data Augmentation

Knowledge Sources	fastai/fastbook Mixup
Domains	Data_Augmentation, Computer_Vision
Last Updated	2026-02-09 17:00 GMT

Overview

Use the Mixup callback (`cbs=Mixup`) for training from scratch or when limited data is available; requires more epochs but reduces overfitting.

Description

Mixup is a data augmentation technique that creates virtual training examples by taking weighted linear combinations of pairs of training inputs and their labels. For each training step, a random image pair is selected and combined: `new_image = t * image1 + (1-t) * image2` with `new_target = t * target1 + (1-t) * target2`, where `t` is a random weight between 0.5 and 1.0. This requires one-hot encoded targets since labels become fractional.

Mixup provides a form of augmentation that is "tunable" in intensity (unlike binary augmentations like flipping) and works across different data modalities, including NLP via activation-level mixing.

Usage

Use Mixup when:

Training from scratch (no pretrained model): Mixup provides strong regularization
Limited training data: Prevents overfitting by creating unlimited virtual examples
Long training runs (80+ epochs): Mixup results improve with more training; for short runs, standard augmentation may be better
Overfitting despite other augmentation: Mixup addresses the problem that traditional augmentations are fixed transforms

Do not use for short training runs (< 80 epochs) where standard augmentation already works well.

The Insight (Rule of Thumb)

Action: Add `cbs=Mixup` when creating the `Learner`.
Value: No hyperparameters needed for basic usage. The mixing weight `t` is automatically sampled from a Beta distribution.
Trade-off: Training is harder (model must predict two labels with weights), so it requires more epochs. The Imagenette leaderboard shows Mixup wins at 80+ epochs but loses at fewer epochs.

Reasoning

Mixup elegantly solves several problems simultaneously:

Continuous augmentation: Unlike flipping or rotation, Mixup intensity is continuously variable.
Label smoothing effect: Targets are no longer hard 0/1 values, preventing the model from becoming over-confident.
Unlimited variety: Every pair combination creates a unique training example, making it nearly impossible to memorize the training set.
Modality-agnostic: Can be applied to images, text (via activation mixing), tabular data, etc.

The mathematical formulation is simple: `(x_new, y_new) = (λ*x_i + (1-λ)*x_j, λ*y_i + (1-λ)*y_j)` where λ is sampled from Beta(α, α).

Code Evidence

Mixup usage from `07_sizing_and_tta.md:283-286`:

model = xresnet50()
learn = Learner(dls, model, loss_func=CrossEntropyLossFlat(),
                metrics=accuracy, cbs=Mixup)
learn.fit_one_cycle(5, 3e-3)

Mixup algorithm pseudocode from `07_sizing_and_tta.md:247-251`:

image2,target2 = dataset[randint(0,len(dataset)]
t = random_float(0.5,1.0)
new_image = t * image1 + (1-t) * image2
new_target = t * target1 + (1-t) * target2

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment