Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Fastai Fastbook Mixup Data Augmentation

From Leeroopedia



Knowledge Sources
Domains Data_Augmentation, Computer_Vision
Last Updated 2026-02-09 17:00 GMT

Overview

Use the Mixup callback (`cbs=Mixup`) for training from scratch or when limited data is available; requires more epochs but reduces overfitting.

Description

Mixup is a data augmentation technique that creates virtual training examples by taking weighted linear combinations of pairs of training inputs and their labels. For each training step, a random image pair is selected and combined: `new_image = t * image1 + (1-t) * image2` with `new_target = t * target1 + (1-t) * target2`, where `t` is a random weight between 0.5 and 1.0. This requires one-hot encoded targets since labels become fractional.

Mixup provides a form of augmentation that is "tunable" in intensity (unlike binary augmentations like flipping) and works across different data modalities, including NLP via activation-level mixing.

Usage

Use Mixup when:

  • Training from scratch (no pretrained model): Mixup provides strong regularization
  • Limited training data: Prevents overfitting by creating unlimited virtual examples
  • Long training runs (80+ epochs): Mixup results improve with more training; for short runs, standard augmentation may be better
  • Overfitting despite other augmentation: Mixup addresses the problem that traditional augmentations are fixed transforms

Do not use for short training runs (< 80 epochs) where standard augmentation already works well.

The Insight (Rule of Thumb)

  • Action: Add `cbs=Mixup` when creating the `Learner`.
  • Value: No hyperparameters needed for basic usage. The mixing weight `t` is automatically sampled from a Beta distribution.
  • Trade-off: Training is harder (model must predict two labels with weights), so it requires more epochs. The Imagenette leaderboard shows Mixup wins at 80+ epochs but loses at fewer epochs.

Reasoning

Mixup elegantly solves several problems simultaneously:

  1. Continuous augmentation: Unlike flipping or rotation, Mixup intensity is continuously variable.
  2. Label smoothing effect: Targets are no longer hard 0/1 values, preventing the model from becoming over-confident.
  3. Unlimited variety: Every pair combination creates a unique training example, making it nearly impossible to memorize the training set.
  4. Modality-agnostic: Can be applied to images, text (via activation mixing), tabular data, etc.

The mathematical formulation is simple: `(x_new, y_new) = (λ*x_i + (1-λ)*x_j, λ*y_i + (1-λ)*y_j)` where λ is sampled from Beta(α, α).

Code Evidence

Mixup usage from `07_sizing_and_tta.md:283-286`:

model = xresnet50()
learn = Learner(dls, model, loss_func=CrossEntropyLossFlat(),
                metrics=accuracy, cbs=Mixup)
learn.fit_one_cycle(5, 3e-3)

Mixup algorithm pseudocode from `07_sizing_and_tta.md:247-251`:

image2,target2 = dataset[randint(0,len(dataset)]
t = random_float(0.5,1.0)
new_image = t * image1 + (1-t) * image2
new_target = t * target1 + (1-t) * target2

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment