Heuristic:Microsoft LoRA Label Smoothing NLG
| Knowledge Sources | |
|---|---|
| Domains | Optimization, NLG |
| Last Updated | 2026-02-10 05:30 GMT |
Overview
Apply label smoothing (typically 0.1) during GPT-2 NLG fine-tuning to prevent overconfident predictions and improve text generation quality.
Description
The GPT-2 LoRA fine-tuning script supports label smoothing via the `--label_smooth` argument. When enabled (value > 0.0001), instead of using standard cross-entropy loss, it computes a weighted combination of the negative log-likelihood loss and a smooth loss (uniform distribution over the vocabulary). The formula is: `loss = (1 - label_smooth) * nll_loss + label_smooth * smooth_loss`. This prevents the model from becoming overconfident in its predictions, which is especially important for text generation tasks where diversity matters.
Usage
Use label smoothing when fine-tuning GPT-2 with LoRA for NLG tasks (E2E, DART, WebNLG). Set `--label_smooth 0.1` as the typical value. This is a training-time optimization; it does not affect inference.
The Insight (Rule of Thumb)
- Action: Set `--label_smooth 0.1` in the GPT-2 fine-tuning command.
- Value: 0.1 is the typical value. The threshold for activation is > 0.0001.
- Trade-off: Slightly lower training loss convergence in exchange for better generalization and more diverse text generation. Label smoothing acts as a regularizer.
Reasoning
In language generation, the model is trained to predict the next token. Without label smoothing, the model is encouraged to assign probability 1.0 to the correct token and 0.0 to all others. This leads to overconfident, repetitive outputs. Label smoothing distributes a small fraction of the probability mass across all vocabulary tokens, encouraging the model to be less certain and more diverse. The smoothed loss is computed explicitly using log_softmax to avoid numerical issues.
Code Evidence
Label smoothing implementation from `examples/NLG/src/model.py:386-395`:
if label_smooth > 0.0001:
logprobs = torch.nn.functional.log_softmax(lm_logits.view(-1, lm_logits.size(-1)), dim=-1)
nll_loss = -logprobs.gather(dim=-1, index=lm_labels.view(-1).unsqueeze(1))
nll_loss = nll_loss.squeeze(1)
smooth_loss = -logprobs.mean(dim=-1)
loss = (1.0 - label_smooth) * nll_loss + label_smooth * smooth_loss
loss = loss.view(_batch, _len)
else:
loss_fct = nn.CrossEntropyLoss(ignore_index=-1, reduce=False)
loss = loss_fct(lm_logits.view(-1, lm_logits.size(-1)), lm_labels.view(-1)).view(_batch, _len)
Training loop passing label_smooth from `examples/NLG/src/gpt2_ft.py:196-198`:
_lm_logits, _lm_loss = model(
_input, lm_labels=_target, lm_mask=_msk, label_smooth=args.label_smooth
)
Command-line argument from `examples/NLG/src/gpt2_ft.py:83`:
parser.add_argument('--label_smooth', default=0.0, type=float, help='label smoothing')