Heuristic:Microsoft LoRA Label Smoothing NLG

Knowledge Sources	Microsoft LoRA NLG LoRA: Low-Rank Adaptation of Large Language Models
Domains	Optimization, NLG
Last Updated	2026-02-10 05:30 GMT

Overview

Apply label smoothing (typically 0.1) during GPT-2 NLG fine-tuning to prevent overconfident predictions and improve text generation quality.

Description

The GPT-2 LoRA fine-tuning script supports label smoothing via the `--label_smooth` argument. When enabled (value > 0.0001), instead of using standard cross-entropy loss, it computes a weighted combination of the negative log-likelihood loss and a smooth loss (uniform distribution over the vocabulary). The formula is: `loss = (1 - label_smooth) * nll_loss + label_smooth * smooth_loss`. This prevents the model from becoming overconfident in its predictions, which is especially important for text generation tasks where diversity matters.

Usage

Use label smoothing when fine-tuning GPT-2 with LoRA for NLG tasks (E2E, DART, WebNLG). Set `--label_smooth 0.1` as the typical value. This is a training-time optimization; it does not affect inference.

The Insight (Rule of Thumb)

Action: Set `--label_smooth 0.1` in the GPT-2 fine-tuning command.
Value: 0.1 is the typical value. The threshold for activation is > 0.0001.
Trade-off: Slightly lower training loss convergence in exchange for better generalization and more diverse text generation. Label smoothing acts as a regularizer.

Reasoning

In language generation, the model is trained to predict the next token. Without label smoothing, the model is encouraged to assign probability 1.0 to the correct token and 0.0 to all others. This leads to overconfident, repetitive outputs. Label smoothing distributes a small fraction of the probability mass across all vocabulary tokens, encouraging the model to be less certain and more diverse. The smoothed loss is computed explicitly using log_softmax to avoid numerical issues.

Code Evidence

Label smoothing implementation from `examples/NLG/src/model.py:386-395`:

if label_smooth > 0.0001:
    logprobs = torch.nn.functional.log_softmax(lm_logits.view(-1, lm_logits.size(-1)), dim=-1)
    nll_loss = -logprobs.gather(dim=-1, index=lm_labels.view(-1).unsqueeze(1))
    nll_loss = nll_loss.squeeze(1)
    smooth_loss = -logprobs.mean(dim=-1)
    loss = (1.0 - label_smooth) * nll_loss + label_smooth * smooth_loss
    loss = loss.view(_batch, _len)
else:
    loss_fct = nn.CrossEntropyLoss(ignore_index=-1, reduce=False)
    loss = loss_fct(lm_logits.view(-1, lm_logits.size(-1)), lm_labels.view(-1)).view(_batch, _len)

Training loop passing label_smooth from `examples/NLG/src/gpt2_ft.py:196-198`:

_lm_logits, _lm_loss = model(
    _input, lm_labels=_target, lm_mask=_msk, label_smooth=args.label_smooth
)

Command-line argument from `examples/NLG/src/gpt2_ft.py:83`:

parser.add_argument('--label_smooth', default=0.0, type=float, help='label smoothing')

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment